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I. 


INTRODUCTION 


A.  MOTIVATION 

1.  Social  Media  and  Twitter 

The  Internet  has  become  indispensable  in  our  daily  lives.  It  allows  us  to  create 
and  maintain  communication  and  collaboration;  through  the  use  of  social  media  and 
Twitter  we  can  share  everything  from  special  milestones  to  routine  daily  activities. 

According  to  statistics  [1],  there  are  nearly  3.010  billion  active  Internet  users, 
which  is  nearly  41%  of  the  world  population,  and  2.078  billion  active  social  media 
accounts.  The  annual  growth  is  21%  for  Internet  users  and  12%  for  active  social  media 
accounts  (Figures  1  and  2). 


OLOBAL  DIGITAL  SNAPSHOT 

A  SNAPSHOT  Of  THE  WOOlDS  KEY  DOfUl  STAJtSTOA  MOoCAIOOS 


TOTAL  ACTIVE  ACTIVE  SOCIAL  UNIOUE  ACTIVE  MOBILE 

POPULATION  INTERNET  USERS  MEDIA  ACCOUNTS  MOBILE  USERS  SOCIAL  ACCOUNTS 


URBANISATION:  53%  PENETRATION  42%  PENETRATION:  29%  PENETRATION:  51%  PENETRATION  23% 


*  Arm  Social  l*w< 


Figure  1.  Snapshot  of  the  World's  Key  Digital  Statistical  Indicators.  Source:  [1]. 
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YEAR-ON-YEAR  GRO 

HOW  THE  OCIUL  WOftlD  HAS  EYCXfEO  OVffi  THE  PAST  12  MONTI 


TOTAL  ACTIVE  ACTIVE  SOCIAL  UNIQUE  ACTIVE  MOBILE 

POPULATION  INTERNET  USERS  MEDIA  ACCOUNTS  MOBILE  USERS  SOCIAL  ACCOUNTS 
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Figure  2.  Annual  Growth  of  the  Digital  World.  Source:  [1]. 


Twitter  is  a  social  networking  service  on  the  Internet.  Twitter  allows  registered 
users  to  compose  and  send  short  messages  of  up  to  140  characters,  called  tweets,  to 
provide  interconnection  between  users  [2].  It  also  allows  unregistered  users  to  read  tweets 
sent  by  others.  We  have  focused  on  Twitter  in  this  thesis  since  it  is  one  of  the  biggest 
social  media  sites  with  645,750,000  registered  users  [3]  and  has  open  source  public 
tweets  for  data  mining. 

2.  Malicious  Users  and  Tweets 

In  the  modem  world,  we  face  a  new  generation  of  terrorists,  such  as  ISIS,  who 
use  social  media,  especially  Twitter,  to  recruit  new  members.  Propagandists  can  spread 
up  to  200,000  tweets  per  day,  using  the  platform  as  a  propaganda  tool  with  videos  and 
pictures  to  feed  the  bad  feelings  of  their  supporters,  spread  fear  to  innocent  people,  and 
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trigger  a  series  of  lone-wolf  terrorist  attacks  on  orders  from  a  terrorist  leader  on  another 
continent  [4]. 

Social  media  is  a  convenient  place  for  malicious  users  for  several  reasons: 

1 .  Detection  is  easily  avoided. 

2.  Many  people  can  be  accessed  with  little  effort. 

3.  Users  in  social  media  can  share  personal  information  such  as  credit 
cards,  passwords,  and  private  data. 

4.  It  is  easy  to  manipulate  people  by  using  popular  contents, 
honeypots,  and  familiar  accounts. 

According  to  [5],  Twitter  has  suspended  nearly  125,000  accounts  in  the  seven 
months  leading  up  to  February  2016  due  to  terrorist  acts.  Twitter  has  been  reducing  the 
response  time  for  detecting  malicious  users  and  tweets  and  suspending  the  accounts  with 
newly  recruited  staff.  However,  Twitter  said  that  hunting  terrorists  on  the  web  is  not  so 
simple  since  there  is  no  “magic  algorithm”  to  detect  terrorist  content. 

In  [6],  Berger  and  Morgan  give  exhaustive  information  about  the  features  of  ISIS- 
supporter  accounts,  their  tweeting  patterns,  and  other  twitter  metrics.  It  is  possible  to  use 
some  of  these  metrics  in  our  thesis  for  specifying  the  malicious  users’  patterns.  They  are: 

1.  For  about  one  out  of  five  ISIS-supporter  accounts,  the  primary  language  is 
English  (73%  Arabic,  18%  English,  and  6%  French).  It  is  common  for  the 
tweet  to  have  an  English  hashtag  with  Arabic  content. 

2.  The  average  number  of  followers  of  these  accounts  is  about  1,004  (higher 
than  regular  users).  While  the  number  of  the  followers  for  a  typical  user  is 
208,  for  celebrities  it  is  in  the  millions.  ISIS  supporters,  therefore,  have 
more  followers  than  ordinary  users.  It  is  very  unlikely  to  see  an  ISIS 
supporter  with  more  than  20,000  followers. 

3.  The  top  locations  of  the  accounts  are  Syria,  Iraq,  and  Saudi  Arabia.  The 
location  information  of  the  accounts  is  very  important  for  detecting  these 
accounts.  However,  only  a  few  users  have  enabled  the  location  features.  In 
addition,  some  of  the  users  are  using  other  applications  to  distort  the  GPS 
coordinates  in  creating  the  tweet. 
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4. 


Botnets  and  applications  have  been  used  to  propagate  a  large  number  of 
tweets  with  specific  content. 


5.  According  to  the  statistics,  nine  out  of  ten  ISIS-supporter  accounts  are 
following  less  than  1000  users.  It  is  not  a  better  indicator  than  follower 
information,  but  it  can  be  useful  to  increase  precision  [6] . 

According  to  [6],  we  make  some  assumptions  to  help  us  identify  malicious 

tweets: 

1.  The  owners  of  the  regular  tweets  have  longer  account  lives  than  malicious 
users  who  tend  to  close  their  accounts  for  reasons  such  as  hiding  their  real 
identity  or  accessing  different  users.  Moreover,  malicious  accounts  can  be 
closed  by  Twitter  after  malicious  behaviors  have  been  detected. 

2.  The  users  spreading  malicious  tweets  have  more  friends  than  followers 
since  they  want  to  access  more  people;  however,  they  are  not  known  by 
others  in  the  network. 

3.  The  accounts  do  not  enable  the  geo  location. 

4.  While  a  high  number  of  followers  (>20,000)  indicates  celebrities,  a  low 
number  (<500)  indicates  regular  users.  Suspicious  users  are  usually 
between  them. 

5.  Malicious  users  follow  less  than  1000  users. 

B.  PURPOSE  OF  THE  STUDY 

Social  media  is  a  rapidly  growing  and  changing  space.  It  is  a  data  pool,  in  that  a 
wide  variety  of  information  can  be  captured  as  long  as  one  knows  how.  This  situation 
encourages  us  to  adopt  a  new  technique  for  the  detection  of  security-related  malicious 
activities.  In  this  thesis,  we  will  use  a  hybrid  of  probabilistic  and  formal  methods  to 
detect  malicious  activities. 

C.  THESIS  STRUCTURE 

Chapter  I  provides  the  motivation  for  this  thesis.  It  explains  the  importance  of 
social  media  in  our  lives.  Chapter  II  provides  background  information  on  formal 
methods,  formal  specifications,  the  Hidden  Markov  Model,  runtime  monitoring,  and 
verification.  This  chapter  is  important  for  understanding  a  new  system  based  on  these 
steps  of  the  development  cycle.  Chapter  III  describes  the  steps  for  collecting,  filtering, 
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and  reforming  the  data  acquired  from  Twitter.  It  provides  useful  information  for  those 
who  want  to  data  mine  in  Twitter,  and  presents  the  natural  language  assertions  and 
corresponding  rule  patterns.  It  then  describes  the  steps  performed  using  screenshots  from 
the  toolset.  The  last  chapter,  Chapter  IV,  addresses  thoughts  and  implications  regarding 
the  new  technique. 
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II.  BACKGROUND 


A.  NATURAL  LANGUAGE  REQUIREMENTS 

The  typical  software  development  process  consists  of  requirement,  design,  and 
implementation  stages.  Before  implementing  any  type  of  formal  method,  software 
developers  try  to  present  what  they  understand  in  terms  of  requirement  and 
environmental  arguments  using  natural  language.  Requirements  are  essential  for  a 
proposed  system,  since  the  system  development  cycle  is  cumulative  in  manner.  If  a 
developer  misses  a  requirement,  it  can  cause  a  heavy  cost  to  fix  the  error.  When 
stakeholder  and  system  needs  are  specified  as  requirements  in  natural  language,  system 
developers  should  translate  them  into  formal  specifications  for  solving  the  problem  in  a 
systematic  manner  [7]. 

One  may  ask,  “If  we  need  to  turn  requirements  into  formal  specifications,  why  are 
we  using  natural  language  in  the  initial  stage?”  (As  done  in  this  project.)  The  answer  is 
that  natural  language  is  necessary  because  stakeholders,  customers,  or  prospective  users 
probably  do  not  understand  the  formal  specifications.  Another  reason  is  that  nobody 
wants  to  sign  a  contract  written  in  formal  specification  language  [8].  So,  natural  language 
specification  is  vital  for  ensuring  that  everyone  is  on  the  same  page. 

B.  FORMAL  METHODS  AND  FORMAL  SPECIFICATIONS 

The  growing  use  of  software-intensive  systems  has  increased  the  complexity  of 
software  development.  This  complexity  multiplies  the  likelihood  of  errors  and  increases 
development  cost.  The  major  goal  of  software  development  is  to  develop  reliable, 
efficient  software-intensive  systems  despite  the  growing  complexity.  At  this  point,  formal 
methods  (FM)  depending  on  a  well-designed  mathematical  structure  are  very  useful  for 
solving  the  problem  and  providing  precise  implementations.  FMs  are  a  general  collection 
of  techniques  including  formal  specification  (FS)  and  program  verification  [9]. 

FS  is  a  technique  based  on  a  mathematical  structure.  It  consists  of  syntax  for 
grammatical  rules  and  semantics  for  interpretation  [10].  The  purpose  of  the  FS  is  to 
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clearly  represent  a  cognitive  or  natural  language  requirement  and  make  it  easy  to  monitor 
system  behaviors  [11]. 

The  use  of  formal  methods  and  specifications  has  changed  over  time  as  a  result  of 
the  growing  complexity  of  systems.  Because  full  formalization  of  systems  has  become 
more  expensive  and  difficult,  some  scientists  have  proposed  lightweight  formal  methods 
focused  on  partial  specification  and  application  [12]. 

Lightweight  formal  methods  are  very  useful  for  reducing  development  cost. 
Moreover,  FSs  and  lightweight  FMs  improve  the  clarity  and  precision  of  specified 
requirements  [13].  Runtime  execution  monitoring  (REM)  is  a  lightweight  FM  that 
monitors  the  behavior  of  a  running  system  and  can  detect  improper  behavior  and 
requirements  in  early  stage  of  development.  Debugging  of  the  requirements  and  early 
detection  of  the  errors  in  the  design  process  prevents  extra  cost  and  time. 

In  [13],  Drusinsky  presents  a  new  type  of  formal  method  using  a  combination  of 
runtime  monitoring,  execution-based  model  checking,  and  UML -based  formal 
specifications  with  statechart  assertions  that  provide  unambiguous,  clear,  and  visual 
presentation  of  the  model.  This  new  formalism  uses  deterministic/nondeterministic 
statechart  assertions  as  its  specification  language  [13].  Figure  3  shows  a  statechart 
assertion  in  which  there  is  a  start  state,  final  state,  event  state,  timer,  and  transitions 
between  them. 


Figure  3.  A  Statechart  Assertion. 
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C.  DETERMINISTIC  RUNTIME  VALIDATION  AND  VERIFICATION 

The  correctness  of  a  system  is  directly  related  to  the  validation  and  verification  of 
the  system.  Verification  checks  whether  the  system  produces  correct  results  given  its 
input.  It  does  so  by  comparing  the  expected  output  to  the  actual  output.  If  there  is  an 
inconsistency,  it  means  that  verification  failed.  Validation  looks  for  whether  the  system 
meets  our  intended  purposes.  Is  it  the  right  product  to  build?  While  the  verification 
process  focuses  on  the  internal  parts  such  as  behaviors  of  the  system,  the  validation 
process  checks  the  overall  system  as  a  final  product  and  compares  to  the  intended  product 

m. 

D.  FORMAL  VERIFICATION  AND  VALIDATION  TRADEOFF  SPACE 

As  Drusinsky  et.  al.  pointed  out  in  [7],  there  are  cost  and  coverage  space  tradeoffs 
for  verification  and  validation  (V&V)  techniques.  We  will  describe  the  tradeoffs  by  using 
two  3-dimensional  (3D)  cuboids.  These  vectors  are  specification/validation, 
program/implementation,  and  verification.  Figures  4  and  5,  respectively,  represent  the 
coverage  space  and  cost  space  tradeoffs  of  V&V  techniques. 

Three  types  of  V&V  techniques  are:  theorem  proving  (TP);  model  checking  or 
property  checking  (MC);  and  execution-based  model  checking  (EMC)  combining 
runtime  verification  (RV)  with  automatic  test  generation  (ATG). 

1.  Theorem  Proving 

High  Order  Logic  (HOL)  and  Stanford  Temporal  Prover  (STeP)  are  the  methods 
using  TP.  TP  employs  mathematics-based  proof  techniques  to  provide  a  persuasive 
argument  that  demonstrate  that  a  program  complies  with  its  requirements.  This  technique 
requires  a  human  driver  to  solve  the  underlying  problem  which  is  generally  undecidable. 

In  respect  to  cost  and  coverage  tradeoffs,  an  important  aspect  of  TP  is  that  a 
human  operator  whose  skill  level  changes  according  to  the  choice  of  the  specification 
language  is  required  [14].  In  TP,  specification  languages  such  as  temporal  logic  are 
generally  difficult  to  understand  and  use.  These  languages  are  hard  to  implement  since 
they  are  different  from  the  languages  that  software  programmers  use.  It  is  difficult  to 
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validate  formal  specifications  with  limited  knowledge  about  temporal  logic  syntax.  So,  it 
has  low  specification  coverage  and  high  specification  cost.  TP  depends  on  the 
programming  languages  that  contain  many  inconsistencies  with  the  existing  Java  or  C++ 
applications.  So  it  deserves  the  low  program  coverage  and  high  implementation  cost  in 
terms  of  the  program/implementation  dimension.  While  the  expertness  requirement  of  TP 
causes  high  verification  cost,  the  existence  of  well-educated  and  wise  users  provides  high 
verification  coverage  [7]. 

2.  Model  Checking 

MC  is  a  kind  of  algorithmic  FV  technique  that  checks  the  state  space  of  the  model 
exhaustively  for  whether  it  satisfies  the  given  specifications.  Some  of  the  published  MC 
tools  are  SPIN  Model  Checker  and  Spatial  Logical  Model  Checker  (SPML)  verifying  the 
correctness  of  the  distributed  model. 

When  a  program  is  set  up  for  MC,  there  is  no  need  for  an  expert  human  operator. 
So,  contrary  to  TP,  human  expertise  is  not  required  in  MC.  It  has  a  lower  verification  cost 
than  TP.  Both  TP  and  MC  have  text-based  specifications,  which  causes  difficulty  in 
visualization  and  validation  processes  for  the  designer.  Consequently,  MC  deserves  low 
specification  coverage  similar  to  TP  [7].  With  respect  to  the  program/implementation 
dimension,  the  most  vulnerable  point  of  this  verification  technique  is  the  blowing 
up  of  the  state-space  (a.k.a.  combinatorial  explosion)  problem.  This  problem  typically 
causes  the  verification  process  to  be  crushed  by  an  exponentially  growing  state-space 
[15].  Eventually,  while  the  technique  deserves  low  coverage  and  high  cost  for 
program/implementation  dimension,  it  has  high  coverage  and  low  cost  for  the  verification 
dimension  because  of  the  automatic  verification  and  full  coverage  of  components. 

3.  Execution-Based  Model  Checking  (EMC) 

Runtime  verification  (RV)  and  automated  test  generation  (ATG)  are  components 
of  EMC.  Some  of  the  RV  tools  are  DBRover  and  StateRover  that  include  statechart 
diagrams  as  specification  language.  StateRover  also  has  an  automatic  test  generator. 
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In  runtime  verification,  the  system  is  monitored  while  running  and  generating 
tests  by  ATG.  With  this  technique,  UML-based  StateRover  specification  language — a 
dynamic,  lightweight  V&V  tool — can  be  used.  It  has  automation  and  can  cope  with  large, 
complicated  systems.  Hence,  it  has  high  coverage  and  low  cost  for  the  specification  and 
program/application  dimensions.  Reliability  to  ATG  is  the  one  weakness  of  this 
technique;  it  is  possible  to  miss  a  violation  that  ATG  cannot  generate  a  suitable  test.  So, 
ATG  provides  low  cost  and  intermediate  coverage  for  the  verification  dimension  [15]. 
Although  RV  has  less  coverage  than  conventional  verification  techniques  like  MC  and 
TP,  it  keeps  us  away  from  the  complexity  of  the  verification  process  [16]. 


Figure  4.  Coverage  Space.  Source:  [7]. 
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Figure  5.  Cost  Space.  Source:  [7]. 

According  to  the  comparison  of  three  V&V  techniques,  EMC  has  lower 
implementation,  verification,  and  specification  costs.  It  has  higher  program  and 
specification  coverages. 

Note  that  we  do  not  verify  any  software  systems  in  our  thesis.  We  use  a  powerful 
FS  language  with  runtime  monitoring  in  EMC  since  runtime  monitoring  satisfies  our 
needs  for  situational  awareness,  which  is  a  prerequisite  for  stable  and  reliable  systems. 
The  method  ensures  more  consistent  and  trustworthy  results  for  categorization  of  tweets 
in  a  running  system. 

E.  HIDDEN  MARKOV  MODEL  (HMM) 

Markov  Models  are  stochastic  models  that  are  used  in  randomly  altering  systems. 
They  have  a  list  of  possible  states.  Each  present  state  has  possible  future  state(s).  There 
are  four  main  Markov  Models  used  in  various  problem  areas  [17].  They  are  the  Markov 
chain,  the  Markov  decision  process,  HMM,  and  the  partially  observable  Markov  decision 
process  (Table  1). 
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Table  1.  Markov  Models.  Source:  [17]. 


System  state  is  fully 
observable 

System  state  is  partially 
observable 

System  is 
autonomous 

Markov  chain 

Hidden  Markov  model 

System  is 
controlled 

Markov  decision  process 

Partially  observable  Markov 
decision  process 

In  simpler  Markov  models  such  as  the  Markov  chain,  there  is  only  one 
parameter — “state  transition  probabilities” — and  states  are  fully  observable.  In  HMM,  the 
states  are  not  fully  visible  and  each  state  has  possible  observations,  state  transition 
probabilities,  and  output  probabilities  [18]. 


Figure  6.  Markov  Model.  Source:  [18]. 


In  Figure  6,  X,  y,  a,  and  b  indicate  states,  possible  observations,  state  transition 
probabilities,  and  observation  probabilities,  respectively. 

An  HMM  is  very  similar  to  a  state  machine:  both  have  states  and  transitions 
between  states.  Each  transition  between  states  and  the  observable  of  the  states  are 
assigned  a  probability  value  between  0  and  1.  The  model  determines  one  of  the  possible 
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outputs  by  looking  at  the  sequence  of  observables  [19].  In  our  thesis,  the  hidden  states  of 
the  model  is  the  categorization  of  the  tweets.  We  will  look  for  patterns,  which  are  seen  as 
rules  in  Section  3,  to  interesting  flag  sequences  of  tweets. 

F.  RUNTIME  MONITORING  AND  VERIFICATION  OF  SYSTEMS  WITH 

HIDDEN  DATA 

Runtime  monitoring  (RM)  is  a  technique  for  observing  runtime  system  behavior. 
While  doing  so,  it  detects  formal  specification  violations. 

When  applying  RM  and  RV  to  complex  systems,  the  required  information  such  as 
the  existence  of  malicious  email  or  tweets  are  not  fully  observable.  In  [20],  Drusinsky 
presents  a  RM  technique  that  can  be  implemented  in  a  system,  which  includes  hidden 
events.  The  technique  uses  UML-based  statechart  assertions;  it  combines  HMM  and  RM 
of  formal  specification  assertions  [20]. 

In  this  study,  we  combine  HMM  with  RM  of  statechart  assertions,  where  the 
HMM  is  used  categorize  tweets  as  one  of:  malicious,  suspicious  or  benign. 

The  flow  chart  of  a  system  using  RM  with  hidden  data  processing  is  shown  in 
Figure  7. 
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Figure  7.  Flow  Chart.  Adapted  from  [20]. 
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III.  RUNTIME  MONITORING  OF  TWEETS 


A.  COLLECTING  AND  FILTERING  DATA 

The  Twitter  Streaming  API  was  used  to  collect  data  from  Twitter.  First,  we 
created  an  account  and  generated  tokens  to  be  used  as  user  credentials.  Figure  8  shows 
the  Python  code  used  for  downloading  tweets  [21].  We  attached  the  unique  user 
credentials  provided  by  Twitter  for  the  variables  access_token,  access_token_secret, 
consumer_key,  and  consumer_secret. 


♦  Code  modified  from  "Connecting  to  Twitter  Streaming  API  and  downloading  data"  code 

♦  obtained  from  http://adilmoujahid. com/posts/2014/07/twitter-analytics/ 
import  json 

import  time 

import  pandas  as  pd 

import  matplotlib.pyplot  as  pit 

♦Import  the  necessary  methods  from  tweepy  library 

from  tweepy  import  Stream 

from  tweepy  import  OAuthHandler 

from  tweepy. streaming  import  StreamListener 

♦Variables  that  contains  the  user  credentials  to  access  Twitter  API 

access_token_secret  ” 

consumer_)cey  = 

consumer_secret 

class  listener (StreamListener) : 
def  on_data (self , data) : 
print (data) 
return  True 

def  on_error (self , status) : 
print (status) 

return  TRUE  ♦do  not  kill  the  stream 

def  on_timeout (self ) : 

return  TRUE  ♦do  not  kill  the  stream 

auth  =  OAuthHandler (consumer_key,  consumer_secret) 
auth. set_access_token (access_token,  access_token_secret) 
twitterStream  =  Stream (auth,  listener ()) 

twitterStream. filter (locations= (-180, -90, 180, 90] ,  languages  =  ['en']) 


Figure  8.  Python  Code  to  Download  Data  from  Twitter.  Source:  [21]. 


We  collected  22,000  tweets  from  publicly  available  Twitter  data  in  the  JSON  data 
structure.  In  this  form,  the  data  is  not  reader-friendly  and  is  unsuitable  for  validation  and 
formal  specifications.  Moreover,  there  are  too  many  features  for  each  tweet.  Hence,  it  is 
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necessary  to  filter  them  for  specific  features  of  interest,  such  as  when  the  user  account 
was  created,  the  number  of  followers  and  friends,  the  number  of  retweets,  text,  and  so  on. 
The  data  was  converted  into  csv  format  and  filtered  by  using  the  Google  Refine  tool, 
which  is  a  powerful  tool  for  messy  data.  Google  Refine  can  filter  the  data  and  transform 
it  to  another  format.  Figure  9  shows  a  snippet  of  data  including  columns  and  records  in 
Google  Refine. 


Open  Export  * 

Help 

9829  records 

Extensions  Freebase » 

Show  as:  rows  records  Show  5  10  25  50  records 

ous  1  - 10  next  > 

ast » 

▼  All 

▼  tweet_created_a 

▼  tweet  Jd 

▼  text 

▼  retweeted 

▼  user_id 

▼  user_created_at  ▼  us 

1. 

Sat  Aug  01  06:13:07 

627361395173003264 

1  named  those  pictures  or  my  pc  Ho  LaurenV  LOL 1  think  Tl  be  a  dreamer  forever 

false 

13933851% 

Mon  Jul  %  06:57:43 

false 

2. 

Sat  Aug  01  06:13:07 

627361395479150592 

also  runnrg  into  ©Nate  Paulson  and  ©TastyTreatMusic  just  made  my  night 

879106909 

54142029 

Mon  Jul  06  06:57:43 

false 

\ud83d\udcaf  YEEE  family  gatherng 

3. 

Sat  Aug  01  06:13:07 

627361395227492352 

I'm  feeing  it  tonight  http  Wt.coVGRKUpITXPD 

627361393465909200 

1045234772 

Sat  Dec  29 15:51:03 

false 

4. 

Sat  Aug  01  06:13:07 

627361395558846465 

©movnouter  1  can  stl  sleep  while  1  am  standing  and  1  can  wake  up  without  making 

304170141 

304170141 

Tue  May  24 

false 

too  much  fuss  Adults  need  nap  too  Not  only  children 

01:57:16 

5. 

Sat  Aug  01  06:13:07 

627361395772751872 

For  the  love  of  god  someone  tel  me  what  the  common  app  essays  are 

false 

1428170821 

Sat  Jul  19  22:39:47 

false 

6. 

Sat  Aug  01  06:13:07 

627361395814875136 

nope  https:WtcoVT5pw29ZFtn 

false 

2660840210 

Sat  Jul  1922:39:47 

false 

7. 

Sat  Aug  01  06:13:07 

62736 13%  19225 1904 

s 

| 

o 

2 

i 

false 

107768477 

Sat  Jul  1922:39:47 

false 

8. 

Sat  Aug  01  06:13:07 

627361395403833344 

Lis  so  chideh 

false 

107768477 

Sat  Jul  19  22:39:47 

false 

9 

Sat  Aug  01  06:13:07 

627361395454148608 

\u2728Frst  Tme  n  Roi  Et\ud83d\ude0a\u2728\ud83d\udc4d 

false 

206247744 

Fri  Oct  22  15:33:29 

false 

S\u0e07\u0e32\u0e  1 9\u0e04\u0e  1 9\u0e04\u0e38',u0e49\u0e  1 9\u0e40\u0e04\u0e22 

S\u0e07\u0e32\u0e  1 9\u0e40\u0e  1 4\u0e08\u0e32\u0e27\u0e39  \ud83d\ude1d 

SKrungsrMoneyFestival2015  #RobnsonRwet\u2026  https  :Wt.coV8buvyvnVEX 

10. 

Sat  Aug  01  06:13:07 

627361396456423425 

The  blue  moon  is  so  cool 

false 

445742603 

Sat  Dec  24  20:13:53 

false 

Figure  9.  A  Snippet  of  Data  in  Google  Refine. 


B.  MEANING  OF  THE  DATA  COLUMNS 

We  used  the  variables  shown  as  columns  in  Figure  10.  The  meaning  of  each 
column  is  defined  in  Table  2. 


user created at 

text 

followers count 

friends count 

user verified 

geo enabled 

2009-03-13 

Super  Lol  https:\/\/t.co\/lrj3JaAEjn 

2946 

1506 

FALSE 

TRUE 

2009-04-18 

@BluthX  @joanwalsh  How  is  stating  fa 

219 

740 

FALSE 

TRUE 

2009-06-03 

23. 1  use  to  be  the  captain  of  SCTCC  LYV 

204 

149 

FALSE 

TRUE 

2010-01-14 

The  seats  are  being  filled  ahead  of  the 

265880 

420 

TRUE 

TRUE 

2010-05-15 

@staceymurdoughl  thanks  chick!  I'll  ta 

1226 

828 

FALSE 

TRUE 

2010-05-18 

@The Gatorr  get  on  damn  lol 

695 

2 

FALSE 

FALSE 

2010-07-12 

@TT  Sisters  @LittleMeThatter  @GaryB 

204 

149 

FALSE 

FALSE 

2010-07-13 

@Theominiking\nYou  was  absolutely 

580 

1 

FALSE 

FALSE 

2010-07-19 

talayelhttp 

935 

24 

FALSE 

FALSE 

Figure  10.  A  Snippet  of  Data  Columns  Used  in  Thesis. 
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Table  2.  Meaning  of  the  Columns. 


Columns 

Meaning 

user_created_at 

The  date  when  the  user  account  of  the  tweet  is  created 

text 

The  messages  sent  through  the  Internet;  the  main  body  of  the  tweet. 

followers_count 

The  current  number  of  followers  for  this  account. 

friends_count 

The  number  of  users  that  this  account  is  following. 

user_verified 

If  set  to  True,  then  Twitter  has  officially  certified  the  user’s  identity. 

geo_enabled 

If  set  to  True,  the  user  has  enabled  Twitter  to  access  location 
information. 

In  Table  3,  we  classified  all  variables  as  alpha  and  beta  columns  since  we  use 
different  columns  for  some  steps  in  our  work  flow.  For  example  the  user_verified  and 
geo_enabled  columns  are  only  used  in  the  learning  phase  of  the  work  flow. 


Table  3.  Alpha  and  Beta  Columns. 


Type 

Column  Name 

Stage  to  being  implemented 

Alpha  columns 

followers count 

These  columns  are  used  in  the 
learning  phase  for  determining 
the  HiddenState  column. 

friends count 

user verified 

geo enabled 

Beta  columns 

user created at 

These  columns  are  used  in  the 
R4B  and  RM  phases. 

text 

followers count 

friends_count 

C.  NATURAL  LANGUAGE  ASSERTIONS  AND  DETERMINISTIC  RULE 
DEVELOPMENT 

The  first  step  for  development  of  a  RM  system  for  tweets  is  to  specify 
requirements  by  using  natural  language  (NL)  and  determine  corresponding  formal 
specifications  (FS)  [10].  We  use  Rules4business  (R4B)  for  formal  specifications.  The 
requirements  in  NL  are  expressed  by  patterns  provided  in  R4B. 
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For  determining  malicious  tweets,  we  made  some  assumptions  based  on  Chapter  I 
(Motivation),  as  follows. 

The  owners  of  the  regular  tweets  have  a  longer  account  life  than  malicious  users 
who  tend  to  close  their  account  for  reasons  such  as  hiding  their  real  identity  or  accessing 
different  users.  Moreover,  malicious  accounts  can  be  deleted  by  Twitter. 

In  [22],  Shingh,  Bansal,  and  Sofat  point  out  that  malicious  users  reach  a  large 
number  of  friends  in  a  short  time  and  use  popular  links  to  spread  their  tweets  faster  and 
attract  attention.  On  the  other  hand,  famous  users  in  Twitter  have  a  larger  number  of 
followers  and  smaller  number  of  friends.  Therefore,  we  can  assume  that  a  large  number 
of  followers  (>20,000)  indicates  celebrity  status,  whereas  low  numbers  (<500)  indicate 
regular  users.  Suspicious  users  are  usually  between  the  two  values  in  number  of 
followers.  On  the  other  hand,  malicious  users  follow  fewer  than  1000  users 
(friendsclOOO). 

According  to  our  inferences  from  [4],  [5],  and  [6],  malicious  users’  accounts  are 
not  verified  by  Twitter  and  the  users  often  disable  the  geolocation  option.  Their  account 
life  is  two  years  or  less  and  they  usually  open  new  accounts  before  their  current  accounts 
are  identified  as  malicious. 

We  use  R4B  to  choose  and  customize  our  statechart  assertions.  R4B  has  two 
different  interfaces  for  customization  of  instances  and  validation,  respectively.  In  the  first 
page,  users  select  rules  according  to  the  NL  assertions.  They  can  create  and  edit 
instances.  In  the  second  page,  in  order  to  validate  assertions,  users  upload  the  spreadsheet 
including  the  columns  that  each  rule  requires.  These  columns  are  our  beta  columns.  We 
create  five  rules;  they  are  instances  of  R4B  rules:  Rule-1,  Rule-3,  Rule-9,  Rule-19,  and 
Rule  21  as  shown  in  Table  4.  In  Table  4,  a  pattern  is  called  as  a  generic  rule  in  R4B. 
“P=HiddenState=="M"  and  friends_count<1000  and  followers_count<20000  and 
followers_count>500”  means  the  event  in  which  the  tweet  has  more  than  1000  friends. 
This  tweet  also  is  marked  as  malicious  and  its  number  of  followers  is  between  500  and 
20000.  “HiddenState=="B"”  and  “HiddenState=="S"”  indicate  the  tweets  marked  as 
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benign  and  suspicious,  respectively.  Figures  11  to  15  show  the  statechart  assertions  for 
generic  rules  called  as  patterns  in  Table  4. 


Table  4.  Instances  of  R4B  Patterns. 


Rules4business 


Rule- 1 

Pattern 

Flag  whenever  event  P  happens. 

Events  and 
Limits&Bounds 

P=HiddenState=="M"  and  friends_count<1000  and 
followers_count<20000  and  followers_count>500 

Description 

Flag  the  tweet  whenever  there  is  a  malicious  (HiddenState) 
tweet  with  <1000  friends  and  >500  followers  and  <20000 
followers. 

Rule- 3 

Pattern 

Flag  whenever  no  event  Q  occurs  between  P  and  R. 

Events  and 
Limits&Bounds 

Q=HiddenState=="B",  P=HiddenState=="M", 

R=HiddenState=="M" . 

Description 

Flag  the  tweet  whenever  there  is  no  benign  tweet  between 
two  malicious  tweets. 

Rule-9 

Pattern 

Flag  whenever  some  pair  of  consecutive  E  events  is  less 
than  time  T  apart. 

Events  and 

Limits 

E=HiddenState==="S",  Time  bounds:  T=4,  Time  units: 
weeks 

Description 

Flag  the  tweet  whenever  two  users  have  suspicious 
(HiddenState)  tweets  and  are  created  <4  weeks  apart. 

Rule- 19 

Pattern 

Flag  whenever  more  than  N  events  E  within  time  T  after  Q. 

Events  and 
Limits&Bounds 

E=friends_count<1000  and  followers_count<20000  and 
followers  count>500,  Q  HiddenState  "M" 

N=3,  T=10 

Description 

Flag  the  tweet  whenever  there  are  >3  tweets  which  are 
<1000  friends  and  >500  followers  and  <20000  followers, 
within  10  days  after  the  user  of  a  malicious  (HiddenState) 
tweet  created. 

Rule- 21 

Pattern 

Flag  whenever  event  Q  occurs  >N  times  between  some 
pair  of  consecutive  E  events 

Events  and 
Limits&Bounds 

E=text.indexOf(“http”)>=0,  E=Q=HiddenState===“S” 

N=2  (count  bounds) 

Description 

Flag  when  there  are  >2  suspicious  tweets  between  any  two 
tweets  including  http  link. 
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♦  Rule  1  Flag  whenever  event  P  happens  (1) 


u 


Instances  of  this  rule: 


Instance  ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

41320161948.00368 

P=Hidden  State— "M"  and 
friends  count<1000  and 
followers  count<20000  and 
followers  count>500 

Flag  the  tweet  whenever 
there  is  a  malicious 
(Hidden State)  tweet  with 
less  than  1000  friends  and 
more  than  500  followers 
and  less  than  20000 

false 

Figure  1 1 .  Instance  of  Rule- 1  from  R4B . 


♦  Rule  3:  Flag  whenever  no  event  Q  occurs  between  P  and  R  (1) 

^*CZH=*9 


O 

u 


Instances  of  this  rule: 


Instance  ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

D( 

23520162258.972187 

Q=HiddenState— ',B",P=HiddenState=="M'',R=HiddenState=-'M" 

1 

V. 

th 

Edit  Instance 


Show  Example  Timeline 


Figure  12.  Instance  of  Rule-3  from  R4B. 
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♦  Rule  9  Flag  whenever  some  pair  of  consecutive  E  events  is  less  than  time  T  apart  (1) 


U 

e 

Instances  of  this  rule: 


Instance  ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

8352016234.931245 

E=HiddenState==="S" 

T=4 

weeks 

Flag  the  tweet  whenever  two  users 
have  suspicious  (HiddenState)  tweets 
and  are  created  less  than  4  weeks 
apart. 

false 

Edit  Instance 


Show  Example  Timeline 


Figure  13.  Instance  of  Rule-9  from  R4B. 


♦  ’Rule  19:  Flag  whenever  more  than  N  events  E  within  time  T  after  Q  (1) 


Edit  Instance 


Show  Example  Timeline 


U 

u 


Instances  of  this  rule: 


Instance  ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Silent 

183520162325.89062 

E=HiddenState==="M"  and 
friends  count<1000  and 
followers  count<20000  and 
followers  count>500,Q=HiddenState==="S” 

N=3 

T=10 

days 

Flag  the  tweet 
whenever 
there  are  more 
than  3  tweets 
which  are 

false 

Figure  14.  Instance  of  Rule-19  from  R4B. 
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♦  Rule  21  Flag  whenever  event  Q  occurs  more  than  N  times  between  some  pair  of  consecutive  E  events  (1) 


i=L 


Instances  of  this  rule: 


Edit  Instance 


Show  Example  Timeline 


& 


Instance  ID 

Events 

Count 

limits 

Time 

bounds 

Time 

units 

Expiration 

date 

Description 

Sik 

203520162342.92953 

Q=text.indexOf("http")>=0,E=HiddenState=“"S 

"  N=2 

Flag  the 
tweet 
whenever 
there  are 

fal 

Figure  15.  Instance  of  Rule-21  from  R4B. 


D.  VALIDATION  OF  ASSERTIONS  IN  RULES4BUSINESS 

We  validated  our  assertions  by  uploading  a  file  called  “validation  spreadsheet”. 
Before  uploading  such  a  csv  or  xlsx  format  spreadsheet  (Figure  16),  we  specified  column 
indexes  as  follows:  “user_created_at=l,  text=2,  folio wers_count=3,  friends_count=4, 
FIiddenState=5,  time=l.”  Because  in  our  spreadsheet  the  date  of  account  creation,  text 
part  of  the  tweet,  number  of  followers,  number  of  friends,  and  state  of  tweet  information 
are  given  in  same  order.  For  example,  number  of  follower  information  is  shown  in  third 
column.  A  time  column  represents  the  baseline  (x-axis)  of  the  flag  timeline  diagram. 


0  -  0_denemeR4B  xisx 

Finished  executing  assertions  using  this  data  You  can  now  visualize  rule  behavior 

■ 

H  • 

" 77  vt;-'  H  HiddenState=5 ,  (time=l 

Figure  16.  Naming  the  Columns  in  R4B. 
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In  the  validation  phase,  we  use  two  different  versions  of  validation  spreadsheets 
as  shown  in  Figures  17  and  18.  The  validation  spreadsheet  consists  of  all  columns  used  in 
the  R4B  site  and  the  HiddenState  column.  These  spreadsheets  have  different  values  to 
induce  flags.  The  highlighted  cells  in  Figure  18  shows  the  differences  between  the  two 
spreadsheets.  Thus  we  can  check  the  expected  results  as  explained  below.  Figures  17  and 
18  depict  the  first  12  rows  of  the  validation  spreadsheet. 


user  created at 

text 

followers count 

friends count 

HiddenState 

2009-03-13 

Super  Lol  https:\/Vt.coVlrj3JaAEjn 

2946 

1506 

S 

2009-04-18 

@BluthX  @joanwalsh  How  is  stating  fa 

219 

740 

S 

2009-06-03 

23. 1  use  to  be  the  captain  of  SCTCC  LYIV 

204 

149 

S 

2010-01-14 

The  seats  are  being  filled  ahead  of  the 

265880 

420 

B 

2010-05-15 

@staceymurdoughl  thanks  chick!  I'll  ta 

1226 

828 

S 

2010-05-18 

@The Gatorr  get  on  damn  lol 

695 

2 

M 

2010-07-12 

@TT Sisters  @LittleMeThatter  @GaryB 

204 

149 

S 

2010-07-13 

@Theominiking  \nYou  was  absolutely  ; 

580 

1 

M 

2010-07-19 

talayelhttp 

935 

24 

M 

2010-07-19 

Most  amazing  moment  of  2016.  Discuss 

935 

736 

M 

2010-07-22 

@TRobinsonNewEra  @lynbrownmp  @ 

543 

312 

S 

2010-07-22 

You  can  stop  it.  Yes  .  Can  stop  it . 

326 

2 

S 

Figure  17.  R4B  Validation  Spreadsheet  Version  1 


user  created  at 

text 

followers count 

friends count 

HiddenState 

2009-03-13 

Super  Lol  https:\A/t.coVlri3JaAEjn 

2946 

1506 

M 

2009-04-18 

(5>BluthX  @joanwalsh  http  How  is  stating 

219 

740 

S 

2009-06-03 

23. 1  use  to  be  the  captai  n  of  SCTCC  LYM  C 

204 

149 

S 

2010-01-14 

The  seats  are  being  filled  ahead  of  the  st; 

265880 

420 

S 

2010-05- IS 

staceymurdoughl  thanks  chick!  I'll  take  p 

1226 

828 

s 

2010-05-18 

(®The Gatorrget  http  on  damn  lol 

695 

2 

S 

2010-07-12 

@TT Sisters  @UttleMeThatter  @GaryBarl 

204 

149 

M 

2010-07-13 

@Theominiking\nYou  was  absolutely  am 

580 

1 

M 

2010-07-19 

talayelhttp 

935 

24 

M 

2010-07-19 

Most  amazing  moment  of  2016.  Discussing 

935 

736 

M 

2010-07-22 

@TRobinsonNewEra  @lynbrownmp  @Grc 

543 

312 

S 

2010-07-22 

You  can  stop  it.  Yes  .  Can  stop  it . 

326 

2 

S 

Figure  18.  R4B  Validation  Spreadsheet  Version  2 


According  to  the  first  validation  spreadsheet,  we  expect  Rule-1  to  induce  an  RM 
flag  in  rows  6,  8,  9,  and  10.  For  Rule-3,  rows  8  and  10  are  expected  to  induce  a  RM  flag. 
While  for  Rule-9  there  is  a  single  RM  flag  expected  in  row  11,  we  do  not  expect  any 
flags  for  Rule- 19  and  21. 
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We  perturbed  some  values  in  the  first  validation  spreadsheet  (Figure  17)  to 
create  a  second  validation  spreadsheet  (Figure  18).  For  example,  in  row  8,  we  changed 
the  HiddenState  value  from  malicious  (M)  to  suspicious  (S).  The  user  now  has  less  than 
1000  friends  and  his/her  number  of  followers  is  between  500  and  20,000.  It  was  induced 
an  RM  flag  for  Rule-1,  but  after  changing  the  HiddenState  value  to  suspicious,  it  is 
expected  to  not  induce  an  RM  flag;  indeed,  it  did  not — as  depicted  in  Figure  19. 
Likewise,  Rule  9  is  expected  to  not  induce  an  RM  flag  in  row  6  because  it  was  not  a 
suspicious  tweet;  indeed,  it  did  not — as  depicted  in  Figure  20.  However,  when  we 
changed  the  HiddenState  column  to  suspicious,  Rule  9  is  expected  to  induce  an  RM  flag 
because  these  two  tweets  were  created  three  days  apart  and  both  of  them  were  classified 
as  suspicious;  indeed,  RM  induced  such  a  flag,  as  depicted  in  Figure  20. 


This  is  the  essence  of  the  validation  phase:  to  check  that  all  rules  induce  an  RM 
flag  precisely  when  expected  to  do  so. 


Row  No 


There  is  no 
flag  for  row 
6  since  we 
changed  the 
HiddenState 
value  from 
M  to  S. 


I  IT 


Flag  Flag 


Flag 


Flag 


F1»I 


Flag 


frri 


M 

14 


26 

26 


32 

32 


36 


37 

37 


07  19  2010  OS  22  2011  OS  29  2013 

05  IS  2010  07/19  2010  09  22  2012  10  16  2013 

07  13  2010  06  IS  2011  0104  2013  10  22/2 

P"Hi<kfcnSt»te““Nr  and  fhends_count  1000  and  foUowers_count  20000  and  follw 
Flag  the  twee*  whenever  there  it  a  malicious  (HiddenState)  tweet  with  less  than  100C 


Flag 


Flag 


P»* 


P 

Flag 


Ft 


P 


Flag 


P 

Flag 


9 

9 

07 

13  2010 


10 

10 

19  2010 
07 


13  22  26  46  4S 

13  22  26  43  45 

02  09  2011  01  04  2013  1106  2014 

19  2010  09  22  2012  09  29  2014 


Event  assignments 
Description 


P=HiddenState“"M’  and  ftiends_caint  1000  and  followers_count  20000  and  followers_couct  5 
Flag  the  meet  whenever  there  is  a  malicious  (HiddenState)  meet  with  levs  than  1000  fhends  and 


Figure  19.  Rule-1  Flag  Timeline  Diagram 
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Row  No. 

Cycle. 

Date 


timecutFire.E  E 
timeoutFire.E  timeoutFire 


timeoutFire  .E 

Flag 


timeoutFire 

E 


«i  ■+'■«*- 


w 


11 

11 


13 

13 


14 

14 


15 

15 


12 

04  IS  2009  01  14  2010  07  12  2010  07/222010  06  18  2011 

03  13  2009  06  03  2009  05  15  2010  07  22  2010  02  09  2011  08  062 


Event  assignments:  E“HiddenState="S" 

Description  Flag  the  tweet  whenever  two  users  have  suspicious  (HiddenState)  tweets  and  are  created  less  tLm^Kveek^iMft 

When  we  changed 


T  liuinz  bounds 


T*4  weeks 


I 


timeoutFireE  E 

tsmeowFire.E  tuneoutFire.E  E 


*4wMb>..»LfcfoMbi  ». 


the  HiddenState 
value  to  suspicious, 
this  row  flagged 
since  it  is  just  3 
days  apart  from 
row  5  which  is  also 
fla*  suspicious. 


Row  No.: 

•> 

3 

4 

5 

6 

11 

12 

14 

15 

Cycle: 

0  2 

3 

4 

5 

6 

11 

12 

14 

15 

Date: 

04/18  2009 

06  03  2009 

01  14  2010 

05  15  2010 

05  18  2010 

07  22  2010 

07  222010 

06  IS  2011 

bt 

Event  assignments  E=*HiddeuState— "S" 

Description  Flag  the  tweet  whenever  two  users  have  suspicious  (HiddenState)  tweets  and  are  created  less  than  4  weeks  apart 


Timing  bounds  T«4  weeks 


Figure  20.  Rule-9  Flag  Timeline  Diagram 


R4B  presents  visualization  for  behavior  of  each  rule.  Figure  21  presents  this 
visualization  for  Rule  9.  In  Figure  21,  the  upper  left  window  shows  statechart  diagram. 
Lower  left  window  presents  uploaded  file  with  flagged  tweet  in  row  1 1 .  Flagged  tweets 
are  displayed  in  red.  The  right  window  is  the  timeline  diagram  showing  all  flag  and  non- 
flag  states  in  time  axis.  The  tweet  in  row  11  is  flagged  since  tweets  in  row  7  and  11  are 
both  suspicious  and  they  are  less  than  four  weeks  apart. 
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C  ft  0  rules4business.com:8080/acmeBank/sGl.  0  Q  G  ^ 


Transaction  date:  07,'22:20!0  Data-source  row  No.:  row: It 


D  ailes4business.com:8080/acmeBank/liveTimeline.html?sRuleName=Flag%20whenever%20so  C 


Data-source  event  :hat  fired:  E=HncenState='S' 


Live  Timeline  Diasram  for:  Flas  whenever  so 
Rule  imtance  ID:  83520162M.931245 


e  pair  of  consecutive  E  events  is  lea  than  tune  T  apart 


r~ 

Imt  :  a 


'Properties* 
public  mt  T: 

Tuner  tuner  =  new  Tuner  (T): 


|  Rules4Business  -  Loaded  Logfile  Data  -  Google  Chrome 

- 

1 

□ 

0  rules4business.com:8080/acmeBank/csvTable.html 
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1 !  |3"  19  2010  ||Moit  amaiins  moment  of  2016.  Discussins  my  dissertation  with  Fatou  Bensouda 
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s®|1410 

|in° 

|2  |3:  22  2010||You  can  stop  it.  Yes .  Can  stop  it . 
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IIS 
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IIS 

1 1 :  |3S  06  201 1  |Some  people  can  vote  dead  Bods’!  Aivavedied  =AMYCA2016 

11267.0 

11146.0 

1  u 

tiineoutFire 
timeoutFiieE 
tmeouiFirel  E  E 

•.uceoutfiiiE  tmieouiFire 


soutFireE 

Fl». 


1 


5 


11 


12 


14 


IS 


It 


19 


24 


25 


2S 


32 


33 


3t 


39 


49 


3  1  2  3  4  5  '  11  12  13  14  15  16  17  It  22  23  24  2«  27  2S  29  30  31  33  34  35  31  39  40  41  43  45  45 

01  14  2010  07:22:2010  08  22  2011  1112  2012  0025  2013  10  00  2013  05  142014  11:22  20 

03  13  2009  05  15  2010  02  09  2011  10  19  2011  12  15  2012  08  03  2013  11  10  2013  O'  14  2014  09  11 

04  18  2009  O'  12  2010  00  18  2011  03  02  2012  01  10  2013  08  29  2013  12  04  2013  09  03  2014 

00  03  2009  07,22:2010  08  00  2011  09  22  2012  04  04  2013  09  04  2013  04  012014  10  04  2014 

s  E=Hidda5tae="S' 

Flat  the  tweet  whenever  two  users  hive  suspicious  (Hidden  State!  tweets  and  are  created  less  Am*  weeks  ap  a.t 


Figure  21.  Success  Flag  for  Rule  9  in  R4B 


E.  STANDARD  STATEROVER  RULE  CREATION  AND  CODE 

GENERATION 

The  StateRover  provides  detection  of  behavioral  patterns  by  using  deterministic 
UML  based  statechart  patterns.  StateRover  extends  the  statechart  based  notation  by 
combining  statechart  diagrams,  Java  action  language,  and  the  built-in  Boolean  flag 
bSuccess. 

The  StateRover  is  referred  by  this  thesis  because  it  is  used  as  part  of  the  code 
generation  process;  the  code  generator  referred  to  in  section  H  does  so  using  code 
generated  by  the  StateRover;  therefore,  we  converted  out  R4B  diagrams  to  StateRover 
diagrams.  If  you  do  not  want  to  read  the  details  about  StateRover,  please  skip  to  the  next 
section. 

In  this  phase,  R4B  diagrams  are  converted  to  StateRover  diagrams.  For  each  R4B 
rule,  we  created  a  corresponding  StateRover  statechart-assertion  (Figure  22).  Statechart 
assertions  start  with  an  initial  state.  Events  are  the  transitions  between  states.  The  final 
state  is  the  Flag  state  that  shows  whether  assertion  fails  or  succeeds.  If  the  StateRover 
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reaches  Flag  state,  it  assigns  a  false  value  to  the  bSuccess  Boolean  variable,  meaning  that 
the  assertion  discovered  a  flagged  scenario. 


Figure  22.  Rule- 19  Statechart  Diagram  in  StateRover 


In  the  StateRover,  automatic  code  generation  requires  that  only  two  arguments  be 
created.  The  first  one  is  the  rules_events. properties  files  for  each  rule.  These  files  are 
simple  text  files;  Figure  23  shows  an  example.  They  contain  text  that  we  already  used  in 
the  R4B  phase  (see  the  Events  and  Limits&Bounds  sections  in  Table  5). 


©  Java  -  Rules/src/Rule19/Rule19_events.properties  -  Eclipse  — 

File  Edit  Navigate  Search  Project  gun  Window  Help 

□  X 

r3*  .  @ 

|  Quick  Access  |  ' 

fi  |  Java 

£  Package  Explorer  23  gij  JUm't  “  □ 

[§)  Rule19_events.properties  23  1  1=3  □ 

G  %  1  & 

Rule1.statechart_diagram  a 

^  Rule1.statechart_properties 
v  $  Rule19 

>  5)  Rul«19.java 

>  (J)  SanityTest.java 

1  #Rulel9  Events .properties  a 

2  E-f ollowers_count>500  and  f ollowers_count<20000  and  £riends_count<1000 

3  Q=HiddenState="M" 

4T-10  days 

5N-3 

d  HU 

|  Find  | 

Figure  23.  Rules  19_events. properties  File 
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The  StateRover  implements  a  two-step  process  to  perform  validation  for  checking 
the  R4B  diagrams  are  drawn  accurately  in  the  StateRover.  In  the  first  step  it  generates 
Java  code  by  simply  saving  our  statechart  diagrams.  In  the  second  step,  we  need  a  JUnit 
test  to  execute  to  assure  that  the  StateRover  has  the  same  behaviors  for  each  statechart 
assertions  as  their  R4B  counterparts.  All  JUnit  sanity  test  codes  are  in  Appendix  A. 
Figure  24  shows  that  all  sanity  tests  run  correctly. 


O  Java  •  Rules/src/Rule19/SanityTest.jdva  -  Eclipse 

File  Edit  Source  Refactor  Navigate  Search  Project  Bun  Window  Help 

rt’  □>  0  -  /*>--*><;,<□]  Hi*. J  - 

:«  Package  Explorer  j/tf  JUnit  £2  “  □  @)  SanityTestjava  £2 

1  Pa<*ase  Rulel9; 

2 

3  //import  static  org. j unit .Assert. *; 


□  X 


[Quick  Access  ;  (g  |  9J  Java 
=  □  =  □ 


Finished  after  0.154  seconds 


Runs:  10/10 


■  qiJ  Rule21.SanityTest  [Runner  JUnit  4]  (0.000  s) 

tg  test  (0.000  s) 
g  testl  (0.000  s) 

■  dQ  Rulel.SanityTest  [Runner  JUnit  4]  (0.000  s) 

g  test  (0.000  s) 
g  testl  (0.000  s) 

’  nil  Rule19.SanityTest  [Runner  JUnit  4]  (0.000  s) 
£]  test  (0.000  s) 
g  testl  (0.000  s) 

’  Gh  Rule3.SanityTest  [Runner  JUnit  4]  (0.001  s) 
eg  test  (0.000  s) 
g  testl  (0.001  s) 

’  §0  Rule9.SanityTest  [Runner  JUnit  4]  (0.000  s) 
g  test  (0.000  s) 
g  testl  (0.000  s) 


=  Failure  Trace 


>  h ° 


5^  import  j unit. framework. TestCase;Q 


9  public  class  SanityTest  { 


11 

12» 


20^ 

21 


Rulel9  rule; 

@Before 

public  void  setup (){ 

rule  -  new  Rulel9(); 

rule.N-3; 

rule.T-10; 

rule . execTRreset ( ) ;  enforce  che  setting  of  T 

> 

@Test 

public  void  test()  { 
rule . incrT ime  ( 1 ) ; 
rule.QO; 
rule. incrT ime  (1) ; 
rule .  E  ( )  |r 
rule. incrT ime  (1) ; 
rule . E ( ) ; 
rule.EO ; 
rule.EO ; 
rule. incrT ime  (8) ; 
rule.timeoutFire() ; 


Javadoc  De:  Q  Console  £2  □  Properties  =  □ 

<terminated>  Rules  llllnitl  CAPrnnram  Files\lava\ire1.fl-0  77\hin\iavaw.eife  tAnr  Ifi  ?01fi  5:5ft:1Q  PM1 
Writable  Smart  Insert  25 : 17 


Figure  24.  Sanity  Tests  Run 


F.  LEARNING  PHASE  FOR  HMM 

In  the  learning  phase,  we  need  to  add  a  HiddenState  column  to  our  spreadsheet  as 
column  7.  Each  tweet  can  be  classified  within  three  categories:  malicious  (M),  suspicious 
(S),  or  benign  (B).  In  this  part,  in  order  to  populate  the  learning-phase  spreadsheet,  we  act 
as  a  tweet  classification  expert.  Table  5  defines  the  rule  used  for  specifying  the  values  of 
HiddenState.  In  order  to  determine  the  values  of  the  HiddenState  column,  a  human 
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operator  uses  followers_count,  friends_count,  user_verified,  and  geo_enabled  columns. 
Figure  25  shows  a  snippet  of  our  learning  phase  spreadsheet.  HMM  creation  and  the 
learning  algorithm  are  explained  in  Section  G. 


Table  5.  The  Rule  to  Determine  HiddenState  in  Learning  Phase 


Columns 

Observation 

Action 

followers_count: 

If  the  number  of  followers  is  between  500  and 
20000 

add  1  to  total 

friends_count: 

If  the  number  of  friends  is  <1000 

add  1  to  total 

user_verified 

If  the  user  is  not  verified 

add  1  to  total 

geo enabled 

If  the  account  does  not  enable  geolocation 

add  1  to  total 

If  the  total  is 

4  :  assign  M  (malicious) 

2-3  :  assign  S  (suspicious) 

0-1  :  assign  B  (benign) 

followers count 

friends count 

user verified 

geo enabled 

HiddenState 

2946 

1506 

FALSE 

TRUE 

S 

219 

740 

FALSE 

TRUE 

S 

204 

149 

FALSE 

TRUE 

S 

265880 

420 

TRUE 

TRUE 

B 

1226 

828 

FALSE 

TRUE 

S 

695 

2 

FALSE 

FALSE 

M 

204 

149 

FALSE 

FALSE 

S 

580 

1 

FALSE 

FALSE 

M 

935 

24 

FALSE 

FALSE 

M 

Figure  25.  The  Population  of  the  HiddenState  Column 


G.  GENERATING  THE  HIDDEN  MARKOV  MODEL  (HMM) 

An  HMM  is  a  statistical  model  in  which  the  state  is  not  fully  visible.  However, 
the  observable,  which  depends  on  state,  is  visible.  The  model  determines  one  of  the 
possible  outputs  by  looking  at  the  sequence  of  observables  [19].  It  provides  a  way  to 
capture  patterns  that  are  essential  to  for  making  more  accurate  decisions. 
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In  our  thesis,  the  hidden  states  are  the  categorization  of  tweets.  We  already 
determined  our  hidden  column  according  to  the  rules  in  the  learning  phase  (Table  5).  The 
spreadsheet,  including  populated  hidden  column  and  other  visible  data  used  in  R4B  rules 
instances,  is  our  learning-phase  spreadsheet  (csv  file).  The  learning  phase  csv  also 
contains  an  indication  of  the  initial  state.  The  learning  phase  spreadsheet  separates  data 
into  two  types:  visible  data  (friends_count,  folio wers_count,  text,  and  user_created_at) 
and  hidden  data  (HiddenState).  Let  v,  h,  and  N  represent  visible  data,  hidden  data,  and 
total  number  of  rows,  respectively.  Also,  let  h,  and  y,  be  the  values  of  the  hidden  and 
visible  columns  in  row  i.  As  Drusinsky  states  in  [23],  our  HMM  learns  the  probability  as 
shown  below: 

•  The  probability  of  transition  between  states  is  obtained  by  dividing  the 
number  of  specific  transition  to  N- 1  (that  is  the  total  number  of  transition 
in  spreadsheet).  For  example,  we  have  40  transitions  from  suspicious  (S) 
to  malicious  (M)  in  h  and  N  is  81.  So,  the  probability  of  HMM  transition 
for  S->M  is  #(S  ->  M)/(N-1)  that  equals  0.5. 

•  Suppose  that  for  a  given  row  i,  v;  is  k  and  hi  is  M.  The  probability  of  an 
observable  k  being  emitted  in  a  hidden  state  M  is  calculated  by  dividing 
the  number  of  times  when  a  row  satisfies  k  and  M  with  N. 

•  Initial  state  probability  distribution  is  the  proportional  number  of  times  a 
hidden  state  is  marked  as  an  initial  state.  For  instance,  if  we  have  two 
states  of  three  marked  as  initial  states  then  the  initial  state  probability 
distribution  is  [0.5,  0.5,  0]. 

While  R4B  supports  an  event  such  as  followers_count<500 ,  being  an  infinite  set 
of  possible  observables,  a  typical  HMM  operates  on  a  relatively  small  number  of 
observables.  Therefore,  we  need  to  quantize  our  values  with  corresponding  value  range 
such  as  followers_countLT500  (number  of  friends  is  less  than  500).  Table  6  indicates  how 
columns  user_created_at,  friends_count,  and  followers_count  will  be  quantized.  The 
Python  codes  for  quantization  are  presented  in  Appendix  B.  Quantization  enables  our 
toolset  to  map  generic  names  like  P,  Q,  and  R. 

The  HMM  generator  needs  a  column  named  HiddenState.  As  we  showed  in 
section  F,  we  played  the  role  of  an  expert  and  filled  the  cells  for  HiddenState  column.  A 
snapshot  of  the  final  learning-phase  csv  file  is  shown  in  Figure  26. 
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Table  6.  Quantization  of  Columns. 


Columns 

Values 

Description 

user_created_at 

new 

old 

If  the  account  is  created 
more  than  two  years 
ago,  this  account  is 

OLD.  Otherwise  it  is 
NEW. 

friends_count 

friends_countLT500 
friends_count500to  1 000 
friends countGT  1 000 

followers_count 

followers_countLT500 

followers_count500to20000 

followers_countGT20000 

F)0  denemeHMM  IS  HS.csv  □ 

1  |lnitialState,user_created_at, text, followers_count, f riends_count,HiddenState 

2  Y, 2009-03-13, Super  Lol  https : \/\/t .  co\/lr;)3JaAE;jn,  2946, 1506,  S 

3  Y, 2009-04-18, ©BluthX  ©joanwalsh  How  is  stating  fact  an  opinion, 219, 740, S 

4  Y, 2009-06-03, 23.  I  use  to  be  the  captain  of  SCTCC  LYM  Crew  #reasonsforsteeltoeguest, 204, 149, S 

5  Y, 2010-01-14, The  seats  are  being  filled  ahead  of  the  start  of  the  2016  #MOGOAwards  at  the  Alisa  h 

6  Y, 2010-05-15, @staceymurdoughl  thanks  chick!  I'll  take  pics  X, 1226, 828, S 

7  Y,  2010-05-18,  @The__Gatorr  get  on  damn  lol,  695, 2, M 

8  Y, 2010-07-12, @TT_Sisters  ©LittleMeThatter  ©GaryBarlow  Very  and  the  video  is  even  cuter  x, 204, 149, 

9  Y, 2010-07-13, @Theominiking  \nYou  was  absolutely  amazing  on  Theovision  \ud83e\uddl7\ud83e\uddl7\uc 

10  Y,  2010-07-19, talaye! http, 935, 24,  M 

11  Y, 2010-07-19, Most  amazing  moment  of  2016.  Discussing  my  dissertation  with  Fatou  Bensouda, 935, 736, 

12  Y, 2010-07-22, ©TRobinsonNewEra  ©lynbrownmp  ©Gropeapanda  ©WestHamLabour  This  is  an  Islamic  sign  an 

13  Y, 2010-07-22, You  can  stop  it.  Yes  .  Can  stop  it  .,326, 2, S 

14  Y, 2011-02-09, To  Six  Flags  we  go!http,927,22,S 

15  Y, 2011-06-18, ©Patsydogl  And  no  one  can  beleeb  I  is  10. ,695, 2, M 

16  Y, 2011-08-06, Some  people  can  vote  dead  Body!  Aiyavedied  #AMVCA2016, 267, 246, S 

17  Y, 2011-08-22, 2-1  Iggy  on  shorthanded  goal, 962, 7, M 

18  Y, 2011-08-22, ©oufcoli  ©BurdonGeorge  @OUFC_  no  jobs  !?  \ud83d\ude02I 'm  a  copper  .  Do  believe  I  was 

19  Y, 2011-10-19, Need  to  be  swooped  to  the  Coliseum, 1029, 841, S 

20  Y, 2012-03-02, "V"*if  you  look  at  this  painting  and  all  you  can  see  are  naked  bodies  ...  the  proble 

21  Y, 2012-07-18, all  this  chisme  gots  me  like  https : \/\/t .co\/wrmK3VBdbK, 695, 1019, S 

22  Y, 2012-09-06, \ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\nl ' VE  NEVER  GOTTEN  A  CA 

23  Y, 2012-09-22, ©BleuSergy  Would  be  interesting  to  visualise ., 1581, 273, M 

24  Y, 2012-09-22, Heading  to  top  3rd, 326, 246, S 

25  Y, 2012-11-12, ©carriekovarik  happy  share  some  screen  grabs  of  ©MyDermPortal  for  your  next  presenta 

26  Y, 2012-12-1S, It ' s  just  a  party  in  Pittsburgh  today  isn't  it, 587, 716, S 

27  Y, 2013-01-04, ©blackzeusx  and  I  pay  tribute  to  the  king.  fBrooklyn  fnocorious  #hiphop  #music  #trit 
_ Y.  2013-01-16.  "GHeLilceGoodMusic  thanks  for  the  follow  Go  check  out  our  single  \""Keep  It  On  The  Lc 


The  first  column  for  each  row  is  Y  that  is  just  a  special  column  indicating  Initial  State. 


Figure  26.  Learning-Phase  CSV  File. 


The  last  work  in  this  step  is  to  run  a  command  for  generating  an  hmm.json  file 
that  includes  the  HMM  in  JSON  (JavaScript  Object  Notation)  (Figure  27). 
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SB  Komut  Istemi 


For  state  a  seeing  CoapountOutput  <OLD,follOKers_count50eto20®ee,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OlD,followers_countLT5®0,f riends_countLT50O> 

For  state  ■  seeing  CoapountOutput  <OLO,followers_count500to20®®0,friends_countlT500> 

For  state  s  seeing  CoapountOutput  <0LD,follo><ers_countLT500,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OLD,followers_count50eto20000,friends_count50etol@ee> 
For  state  s  seeing  CoapountOutput  <OLD,follo«ers_count5Oato2O©O0,friends_count5OatolOOO> 
For  state  s  seeing  CoapountOutput  <OLD,followers_count500to20000Jfriends_countGTl000> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT50O,friends_countLT50O> 

For  state  a  seeing  CoapountOutput  <OLD,followers_count5OOto2©OO0,friends_countLT5Oe> 

For  state  s  seeing  CoapountOutput  <OLDJfollo**ers_countLTSO0,friends_countLT50a> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT500,friends_count5eetolO0O> 

For  state  s  seeing  CoapountOutput  <OLO,follo*<ers_count5O0to2O000,friends_countSOOtol000> 
For  state  a  seeing  CoapountOutput  <OLD,followers_count5OOto20OO0,friends_countLT50®> 

For  state  s  seeing  CoapountOutput  <OLDJfollowers_count500to20O00,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OLD,followers_count50eto2O©O0,friends_countLT5Oa> 

For  state  s  seeing  CoapountOutput  <OLD,folloxers_countlT500,f riends_countLT500> 

For  state  s  seeing  CoapountOutput  <OLDJfollowers_countLTS0O,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OlD,folloxers_count5©Oto20©©0,friends_countGTl®©0> 

For  state  a  seeing  CoapountOutput  <OLD,followers_count5O0to2Oe00,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OLD,folloxers_count5Oato2OOO0,friends_countLT50O> 

For  state  s  seeing  CoapountOutput  <OLO,followers_countlT500,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT5®0,friends_countLT5eo> 

For  state  a  seeing  CoapountOutput  <OLD,followers_count500to2OOO0,friends_countLT5O0> 

For  state  a  seeing  CoapountOutput  <OLD,followers_count50Oto20Oe0,friends_countLT50O> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT5®0,friends_countlT50O> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT50O,friends_countLT5O0> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT5ee,friends_countLT5ee> 

For  state  b  seeing  CoapountOutput  <OLO,followers_countLTS00,friends_countGTl0O0> 

For  state  s  seeing  CoapountOutput  <OLD,followers_countLT50O,friends_countLT50e> 

For  state  b  seeing  CoapountOutput  <NEW,followers_countlT500,friends_countGTie00> 

For  state  s  seeing  CoapountOutput  <NEW,follOKers_countLT5ee,friends_countLT5@®> 

For  state  s  seeing  CoapountOutput  <NEW,followers_countlT5©0,f riends_count5O0tol0O0> 

For  state  a  seeing  CoapountOutput  <NEW,followers_count500to20O00,friends_countLTS00> 

For  state  s  seeing  CoapountOutput  <NEW,followers_count5O0to200O0,friends_countGT10O0> 

For  state  a  seeing  CoapountOutput  <NEW,followers_count50Oto20O00,friends_countl.T50O> 

For  state  a  seeing  CoapountOutput  <NEW,followers_count5OOto2O0O0,friends_countLT5OO> 

For  state  s  seeing  CoapountOutput  <NEW,followers_countLT5O0,friends_countLT50O> 

For  state  s  seeing  CoapountOutput  <NEW,follOKers_countLT5O0,friends_countLT500> 

For  state  s  seeing  CoapountOutput  <NEW,followers_countLT500,f riends_countlT5O0> 

For  state  s  seeing  CoapountOutput  <NEW,followers_countLT50O,friends_countLT50O> 

For  state  s  seeing  CoapountOutput  <NEW,folloxers_countLT50O,f riends_count5O0toie©O> 
b<!  output  file  is  stored  in:  C:\users\zeyzeyia\Google  Drive\new_begin\dene»eler\haa. json 

^:\Users\zeyzeyia\Google  Drive\new_begin> 


Figure  27.  Command  Prompt  Run  for  Quantization 


H.  GENERATING  SPECIAL  JAVA  CODE  FOR  PROBABILISTIC 
RUNTIME  MONITORING 

In  this  step,  we  create  a  new  Java  project  including  automated  Java  codes  and 
sanity  tests  (Figure  28).  Sanity  tests  validate  the  generated  Java  codes  running  correctly. 
Appendix  C  shows  the  sanity  tests  of  Rulel_DTRA,  Rule3_DTRA,  Rule9_DTRA, 
Rulel9_DTRA  and  Rule21_DTRA  java  files. 
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0  Java  -  Rules/src/Rule21/SanityTest.java  -  Eclipse 

File  Edit  Source  Refactor  Navigate  Search  Project  Run 

n  -  E  X  * 

(2  Package  Explorer  £3  J  JUnit 
V  £2  DTRA.Rules 
v  3  src 

>  3  com.timerover.staterover.ifacesrc 
v  .$}  Rulel 

>  *Z)  Rule1_DTRAjava 

>  [7]  SanityTestjava 
v  ^  Rule19 

>  2)  Rule19_DTRAjava 

>  [7]  SanityTestjava 

v  Rule21 

>  2)  Rule21_DTRAjava 

>  [7]  SanityTestjava 
v  jj}  Rule3 

>  2)  Rule3_DTRAjava 

>  [7]  SanityTestjava 

v  Rule9 

>  2  Rule9_DTRAjava 

>  [7]  SanityTestjava 

>  fi k  JRE  System  Library  [JavaSE-1.8] 

>  JUnit  4 

1  >  8^  Referenced  Libraries _ 


Figure  28.  DTRA_Rules. 


The  special  Java  code  for  probabilistic  RM  implements  an  algorithm  indicated 
in  [23].  This  algorithm  uses  an  input  sequence  in  the  form  of  a  two-tuple  list,  such 
as  Input={Ki,Pi},{K2,P2},  {K3,P3}...{Kn,Pn}-  K,  is  either  a  visible  event 
(i.e.,  friends_count,  followers_count  columns)  or  a  hidden  one  (i.e.,  HiddenState 
column).  Pi  is  the  probability  of  distribution  (POD)  of  Kj.  The  POD  of  a  visible  event  is 
1;  the  POD  of  a  hidden  event  is  taken  from  the  results  of  running  the  alpha  method  on  the 
HMM  (the  Alpha  Method  in  Section  I). 

As  pointed  out  in  [23],  the  run  time  evaluation  of  an  assertion  consists  of  a 
collection  of  objects  called  configurations.  We  label  a  collection  as  Col  and  a 
configuration  as  Conf.  Each  Conf  has  a  present  state  PS(Conf)  and  a  probability  value 
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called  P(Conf)  being  the  probability  of  the  assertion  being  in  state  configuration  Conf.  In 
the  start-up,  there  is  a  single  configuration  Conf  whose  with  P(Conf)=l.  Given  and  event 
Kj  whose  Pi  is  less  than  1  (i.e.,  K,  is  hidden),  the  Conf  respond  with  the  pairs,  {Si,  Pi}, 
with  two  configurations  called  Confl  and  Conf2.  The  probabilities  and  states  of  Confl 
and  Conf2  are  calculated  as  follows: 

P(Conf  1  )=P(Conf)  *Pi  and  P(Conf2)=l-P(Confl) 

PS  (Confl)  is  the  following  state  decided  by  transition,  if  event  fired. 

Otherwise,  PS(Conf)  assigned  the  PS(Conf2). 

Note  that  two  or  more  configurations  that  share  the  same  present  states  are 
combined  into  one  configuration  as  Confcombmed  by  summing  all  participating  P’(Conf) 
probabilities. 

A  statechart  assertions  declares  the  probability  of  a  violation  of  its  corresponding 
requirements,  also  known  as  probability  of  failure  (POF)  [23],  being  the  sum  of  all 
P(Conf)  for  all  Conf  that  reach  the  StateRover  error  (R4B  flag)  state. 

I.  RUNTIME  MONITORING 

1.  The  Alpha  Method 

A  critical  part  of  the  novel  RM  process  used  in  this  thesis  is  the  execution  of  the 
HMM  alpha-method  detailed  in  the  sequel.  The  outcome  of  this  step  is  a  file  call 
alpha.json  (Figure  29). 
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Figure  29.  Generating  alpha.json  File. 


According  to  [23],  the  alpha  method  calculates  the  HMM’s  POD  over  time,  given 
the  input  csv  file  (Figure  30).  More  specifically,  for  every  time  slot  t  (i.e.,  for  every  row 
of  the  csv  file),  each  HMM  state  s  is  assigned  an  alpha  value  as(t)  being  the  probability  of 
the  HMM  being  in  state  s  at  time  t. 
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Cf  C:\Users\zeyzeyim\Google  Drive\new_begin\denemeler\0_denemeHMM_NO_IS_HS.esv  -  Notepad-*-* 

Dosya  DGzenle  Ara  Gorunum  Kodlama  Oilier  Ayarlar  Makrolar  ^aliftir  Eklentiler  Pencereler  ? 


□  X 


X 


o  s  ©  *  |  g  3 1  s»  i  >  c3  oa  a|  a  @  ta  0  si  a 


1  user_created_at, text, followers_count, f riends_count 


2  2009-03-13, Super  Lol  https : \/\/t . co\/lrj3JaAEjn, 2946, 1506 

3  2009-04-18, GBluthX  @joanwalsh  How  is  stating  fact  an  opinion, 219, 740 

4  2009-06-03,23.  I  use  to  be  the  captain  of  SCTCC  LYM  Crew  Ireasonsforsteeltoeguest, 204, 149 

5  2010-01-14, The  seats  are  being  filled  ahead  of  the  start  of  the  2016  tMOGOAwards  at  the  Alisa  Hotel  in  Accra,  https :\/\/t.co\/0N641 

6  2010-05-15, @staceymurdoughl  thanks  chick!  I'll  take  pics  X, 1226, 828 

7  2010-05-18, 6The_Gatorr  get  on  damn  lol, 695, 2 

8  2010-07-12, @TT_Sisters  ©LittleMeThatter  @GaryBarlow  Very  and  the  video  is  even  cuter  x, 204,149 

9  2010-07-13, @Theominiking  \nYou  was  absolutely  amazing  on  Theovision  \ud83e\uddl7\ud83e\uddl7\ud83e\uddl7\ud83e\uddl7\ud83e\uddl7  ht 

10  2010-07-19, talaye !http, 935,24 

11  2010-07-19, Most  amazing  moment  of  2016.  Discussing  my  dissertation  with  Fatou  Bensouda, 935, 736 

12  2010-07-22, @TRobinsonNewEra  @lynbrownmp  @Gropeapanda  SWestHamLabour  This  is  an  Islamic  sign  and  no  out  cry  from  liberals  but  do  tl 

13  2010-07-22, You  can  stop  it.  Yes  .  Can  stop  it  .,326,2 

14  2011-02-09, To  Six  Flags  we  go!http,927,22 

15  2011-06-18, @Patsydogl  And  no  one  can  beleeb  I  is  10.,  695, 2 

16  2011-08-06, Some  people  can  vote  dead  Body!  Aiyavedied  IAMVCA2016, 267, 246 

17  2011-08-22,2-1  Iggy  on  shorthanded  goal, 962, 7 

18  2011-08-22, fioufcoli  @BurdonGeorge  @OOFC_  no  jobs  •?  \ud83d\ude02I 'm  a  copper  .  Do  believe  I  was  working  last  night  . ,326,246 

19  2011-10-19, Need  to  be  swooped  to  the  Coliseum, 1029, 841 

20  2012-03-02, "\""if  you  look  at  this  painting  and  all  you  can  see  are  naked  bodies  ...  the  problem  is  you", 1124, 833 

21  2012-07-18, all  this  chisme  gots  me  like  https :\/\/t . co\/wrmK3VBdbK, 695, 1019 

22  2012-09-06, \ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\n\n\ud83d\udc97\nl 1 VE  NEVER  GOTTEN  A  CALL  FROM  YOD  SO  THIS  WOULD  MEAN  A  LOT 

23  2012-09-22, @BleuSergy  Would  be  interesting  to  visualise ., 1581, 273 

24  2012-09-22, Heading  to  top  3rd, 326, 246 

25  2012-11-12, @carriekovarik  happy  share  some  screen  grabs  of  @MyDermPortal  for  your  next  presentation; ), 458, 684 

26  2012-12-15, It's  just  a  party  in  Pittsburgh  today  isn’t  it, 587, 716 

27  2013-01-04, 6blackzeusx  and  I  pay  tribute  to  the  king.  fBrooklyn  Inotorious  Ihiphop  #music  #tribute\u2026  https :\/\/t.co\/mhj7A7iT2r 

28  2013-01-16, "@MeLikeGoodMusic  thanks  for  the  follow  Go  check  out  our  single  \""Keep  It  On  The  Low\""[Prod  by  Nard  &amp;  B]  https:\/' 

29  2013-04-04,  Lol  I  love  em  https : \/\/t .  co\/U6VZm;)N8HK,  513, 408 

30  2013-06-25, Am  j  doing  something  wrong, 304, 294 

31  2013-08-03, I’m  next  up\ud83d\ude0a\ud83c\udfc8, 492, 358 

32  2013-08-18, ICallMeSeb  0SEBTSB  Help, 695, 1116 

33  2013-08-29, Sis  pops  up  to  LF  for  2nd  out  #RoadiesSB2016, 4051, 133 

<  > 
Normal  text  file  length:  4879  lines:  56  Ln:1  Col:1  SeJ:0|0  Dos\Wmdows  UTF-8  INS 


Runtime 


2.  Probability  of  Flag  States 

Each  row  of  each  rule  has  a  probability  value  in  the  range  0-1.  This  probability 
represents  the  likeliness  of  reaching  the  flag  state.  Figure  31  shows  a  list  of  probabilities 
for  Rule  3.  For  example,  while  row  7  has  a  47%  probability  to  reach  a  flag  state,  this 
probability  for  row  47  is  100%.  The  tool  presents  an  effective  way  to  deal  with  malicious 
users  and  tweets.  Because  defining  a  threshold  and  analyzing  the  data  up  to  this  threshold 
can  save  time  and  effort. 
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Rom  7: 

probability  of  Flag=0. 4729757742430315 

Row  8: 

probability  of  Flag=0. 4729757742430315 

Row  9: 

probability  of  Flag=0. 7603658488632147 

Row  10 

probability  of  FlagsO. 9140943225309589 

Row  11 

probability  of  Flag=0. 9296074288513345 

Row  12 

probability  of  Flag=0. 9742770107832159 

Row  13 

probability  of  Flag=0. 9742770107832159 

Row  14 

probability  of  Flag=0. 9905960332250392 

Row  15 

probability  of  Flag=0. 996912755448593 

ROW  16 

probability  of  Flag=0. 996912755448593 

ROW  17 

probability  of  Flag=0. 998913863436215 

Row  18 

probability  of  Flag=0. 998913863436215 

ROW  19 

probability  of  Flag=0. 9991182773132586 

Row  20 

probability  of  Flag=0. 9992909949642688 

Row  21 

probability  of  Flag=0. 9992909949642688 

Row  22 

probability  of  Flag=0. 9992909949642688 

Row  23 

probability  of  Flag=0. 9997553803421885 

Row  24 

probability  Of  Flag=0. 9997553803421885 

Row  25 

probability  of  Flag=0. 9997553803421885 

Row  26 

probability  of  Flag=0. 9998024710640785 

ROW  27 

probability  of  Flag=0. 9999345420618497 

ROW  28 

probability  of  Flag=0. 9999797268849568 

Row  29 

probability  of  Flag=0. 9999937756946659 

Row  30 

probability  of  Flag=0. 9999937756946659 

Row  31 

probability  of  Flag=0. 9999937756946659 

Row  32 

probability  of  Flag=0. 9999937756946659 

Row  33 

probability  of  Flag=0. 9999979169506338 

Row  34 

probability  of  Flag=0. 9999993648710289 

Row  35 

probability  of  Flag=0. 9999993648710289 

Row  36 

probability  of  Flag=0. 9999993648710289 

Row  37 

probability  of  Flag=0. 9999997892605491 

Row  38 

probability  of  Flag=0. 9999999362467353 

Row  39 

probability  Of  Flag=0. 9999999362467353 

ROW  40 

probability  of  Flag=0. 9999999362467353 

ROW  41 

probability  of  Flag=0. 9999999362467353 

ROW  42 

probability  of  Flag=0. 9999999362467353 

ROW  43 

probability  Of  Flag=0. 9999999362467353 

ROW  44 

probability  of  Flag=0. 9999999362467353 

ROW  45 

probability  of  Flag=0. 9999999362467353 

ROW  46 

probability  of  Flag=0. 9999999362467353 

ROW  47 

probability  of  Flag=1.0 

ROW  48 

probability  of  Flag=l.0 

Row  49 

nrohahi 1  if u  nf  Flao—l 

-  -  - -  _ 

A  list  of  probability  values;  one  per  cycle  (CSV  file  row)  is  the  probability  of  the  monitor 
reaching  the  Flag  state  in  that  cycle. 


Figure  31.  Runtime  Monitoring  Rule  3 
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IV.  CONCLUSIONS 


A.  SUMMARY 

In  this  thesis,  we  demonstrated  a  new  technique  to  perform  RM  with  hidden  data. 
The  purpose  of  the  thesis  is  to  determine  whether  using  such  technique  for  the  detection 
of  malicious  tweets/users  has  the  potential  of  detecting  patterns  of  interest  more 
efficiently  than  done  so  far. 

This  new  technique  uses  a  powerful  tool,  which  is  more  effective  than  others  are 
since  it  has  the  capability  of  handling  datasets  including  non-observable  data.  In  addition, 
this  technique  uses  English  specifications  as  the  starting  point,  yet  caters  for  unambiguity 
(using  underlying  formal  specifications)  and  visual  debugging;  for  example,  an  English 
starting  point  rule  is  “Flag  whenever  event  Q  occurs  fewer  than  N  times  between  events 
P  and  R.” 

The  technique  can  be  used  for  pattern  detection  in  many  different  domains, 
such  as  detection  of  fraudulent  credit  card  transactions,  traffic  light  controllers, 
automated  border  security  and  warning  systems,  detection  of  malicious  email,  tweet,  and 
messages,  etc. 

Determining  the  NL  assertions  and  converting  them  into  corresponding  formal 
specification  language  is  a  problematic  area  in  software  engineering.  In  our  technique, 
UML-based  statecharts  offer  a  low  learning  curve  and  are  very  intuitive  and  simple. 
Automated  code  generation  in  the  StateRover  phase  makes  the  tool  a  technique 
combining  validation  and  monitoring  of  data.  Simple  implementation,  domain 
independency,  and  automated  code  generation  with  runtime  monitoring  are  the  features 
that  differentiate  it  from  other  tools. 

In  social  media,  although  there  is  voluminous  data  flow,  it  is  still  possible  to 
create  an  effective  system  that  can  detect  malicious  activities  in  a  brief  time  and  provide 
situational  awareness.  The  important  part  of  the  work  for  a  reliable  system  is  to  specify 
event  sequences  that  indicate  malicious  activity.  Considering  event  sequences  allow  the 
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system  to  deduce  more  precise  inferences.  In  this  thesis,  we  specify  some  event  patterns 
indicating  malicious  activities  according  to  [6]. 

Finally,  there  is  no  magic  system  to  detect  malicious  content  in  Twitter  or  other 
social  media  platforms.  However,  there  are  some  approaches  to  create  new  systems 
providing  better  situational  awareness  like  this  technique. 

B.  FUTURE  WORK 

Social  media  analysis  is  very  popular  and  there  are  many  opportunities  for 
extending  the  scope  of  this  thesis. 

In  this  thesis,  we  use  the  relationship  between  malicious  users  and  six  attributes 
of  tweets.  These  attributes,  which  are  only  a  small  part  of  all  available  attributes,  are  the 
creation  date  of  the  account,  the  number  of  friends  and  followers,  enabling  geolocation, 
verified  account,  and  text  part.  It  is  possible  to  add  more  attributes  for  more  accurate 
results.  What  is  necessary  is  to  find  different  indicators  of  malicious  content  and  user 
behavior,  then  use  them  with  related  attributes  in  the  work  flow. 

The  malicious  content  and  users  can  be  subclassified,  such  as  “terrorist”  and 
“fraudulent  behavior.”  Because  these  can  have  different  indicators,  future  studies  could 
focus  on  either  category. 
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APPENDIX.  SCREENSHOTS  FROM  WORK  FLOW 


JUNIT  SANITY  TESTS 
1.  Rule-1 


7)  SanityTestjava  23  [J)  SanityTestjava  [J)  SanityTestjava 

0  SanityTestjava 

(J)  SanityTestjava 

1 

package  Rulel; 

2 

3 

//import  static  org. junit .Assert. *; 

5®  import  junit . framework. TestCase;Q 

8 

9 

public  class  SanityTest  { 

10 

Rulel  rule; 

11 

1 

12- 

©Before 

13 

public  void  setup(){ 

14 

rule  =  new  Rulel (); 

15 

> 

16 

17- 

©Test 

18 

public  void  test()  { 

19 

rule . P ( ) ; 

20 

TestCase. as.serfcFalse( rule. isSuccess  () )  ; 

21 

> 

22  - 

©Test 

23 

public  void  testl()  { 

24 

rule . timeoutFire { ) ; 

25 

TestCase . assertTrue (rule . isSuccess ( ) ) ; 

26 

> 

27 

28 

> 

29 

30 

Figure  32.  Rule  1  Sanity  Test. 


2. 


Rule-3 


(7)  SanityTestjava  (7)  San ityTest java  £2  [7)  SanityTestjava  (7)  SanityTestjava  0  SanityTestjava 
__  1  package  Rule 3; 

2 

3  //import  static  org.j unit. Assert.*; 

4 

5<F  import  junit. framework. TestCase;^ 

8 

9  public  class  SanityTest  { 

10  Rule3  rule; 

11^  @Before 

12  public  void  setup!) { 

13  rule  *  new  Rule3 ( ) ; 

14  ) 

15 

16^  @Test 

17  public  void  test()  { 

18  rule. P () ; 

1 9  rule . Q_and_notR ( ) ; 

20  rule.pT); 

21  rule.RO ; 

22  TestCase.assertFalse(rule.isSuccess () ) ; 

23  ) 

24 

25€»  @Test 

26  public  void  testl()  ( 

27  rule.Pf) ; 

2  8  rule . Q_and_no tR ( ) ; 

29  rule.RO; 

TestCase.assertrrue(rule.isSuccess () ) ; 

31  ) 

32 


Figure  33.  Rule  3  Sanity  Test. 


3.  Rule-9 


[J)  SanityTestjava  [7]  SanityTestjava  2)  SanityTestjava  £2  0  SanityTestjava 

0  SanityTestjava 

package  Rule9; 

3 

//import  static  org.j unit. Assert.-; 

5#  import  junit. framework. TescCase;Q 

9 

public  class  SanityTest  1 

10 

Rule 9  rule; 

- 

SBefore 

12 

public  void  setup (){ 

13 

rule  *  new  Rule9(); 

14 

rule.T-10; 

15 

rule. execTRreset (); //enforce  the  setting  of  T 

•'416 

> 

It 

STest 

19 

public  void  test()  { 

20 

rule . incrlime ( 5 ) ; 

21 

rule . E ( ) ; 

22 

rule . incrlime  ( 15) ; 

23 

rule.E  () ; 

24 

TestCase.assertrrue(rule.isSuccess () ) ;  //E's  are 

more  than  30  units  apart 

25 

> 

26 

2“  - 

STest 

2  S 

public  void  tescl()  < 

29 

rule.incrTime(3) ; 

30 

rule.E (); 

31 

rule . mcrTime  (5) ; 

32 

rule.E (); 

33 

TestCase.assertFalse(rule.isSuccess () ) ;  //E's  are 

less  than  30  units  apart 

34 

) 

Figure  34.  Rule  9  Sanity  Test. 
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4. 


Rule-19 


Figure  35.  Rule  19  Sanity  Test. 


5.  Rule-21 


0  SanityTestjava  [J)  SanityTestjava  [J)  SanityTestjava  [J)  SanityTestjava 

[7)  SanityTestjava  S3 

1 1 

package  Rule21; 

2 

3 

//import  static  org.j unit. Assert.*; 

5*1  import  junit . framework. TestCase;Q 

9 

public  class  SanityTest  { 

10 

Rule21  rule; 

ll*^ 

SBefore 

12 

public  void  setup(){ 

13 

rule  ■  new  Rule21(); 

14 

rule.N**l; 

15 

> 

16- 

6Test 

17 

public  void  test()  { 

18 

rule . E ( ) ; 

19 

rule.Q  and  notE(); 

20 

rule . E ( ) ; 

21 

TestCase . assert True (rule . isSuccess ( ) ) ; 

22 

) 

23" 

@Test 

24 

public  void  testl()  { 

25 

rule . E ( ) ; 

26 

rule . Q  and  notE  ( ) ; 

27 

rule.Q  and  notE(); 

28 

rule . E ( ) ; 

29 

TestCase . assertFalse ( rule . isSuccess ( ) )  ; 

30 

> 

31 

32 

} 

Figure  36.  Rule  21  Sanity  Test. 
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D.  PYTHON  QUANTIZATION  SCRIPTS 


!_&  quantizeUserCreated.py-  C:\Users\zeyzeyim\GoogleDrive\new_begin\PythonQuantizationSc...  —  □  X 

File  Edit  Format  Run  Options  Window  Help 

n  n  n  ▲ 

Script  for  quantizing  the  user_created_at  column. 

The  user_created_at  column  data  is  provided  as  an  argument:  one  string  separated 

The  Java  caller  prepares  the  data  this  way 

The  output  values  are  printed  to  stdout  one  line  at  a  time 

n  n  it 

import  sys 
import  os 
import  datetime 
list  =  sys.argv 

#list  =  [ 1 /Osers/PythonQuantizationScripts/quantizeOserCreated.py' ,  '2016-(M-04_ 
#print (len (list) ) 
if  (len (list)  !=  2) : 

print ("CallError:  expecting  two  arguments  (path  to  this  script  and  a  string 

sys.exit(O) 

#  make  a  list  by  splitting  the  string  with  . 
cells  =  list[l] .split ("_") 

(^quantization 

i  =  datetime. datetime. now () 
for  icell  ir.  cells: 

#  in  my  case  this  cell  includes  date  with  day-month-year  format 
year,  month,  day  =  map(str,  icell . split  ("-") ) 

#  icell  =  year  +  +  month  +  1 - '  +  day 

#  print  (day,  year,  month) 
present  =  datetime. datetime. now () 

(tprint (present) 

created_at  =  datetime . datetime . strptime (icell,  '%Y-%m-%d') 

#created_at  =  datetime . strptime (i . year ,  i. month,  i.day) 
icell2  =  str (int (year) +2)  +  +  month  +  '-’  +  day 

#after2year  represents  the  date  "exactly  one  year  after  the  creation  of  the 
after2year  =  datetime . datetime . strptime (icell2,  '%Y-%m-%d') 

#  If  an  account  created  more  than  2  year  before.  It  is  OLD. 

#  If  an  account  created  les3  than  2  year.  It  is  NEW. 
if  (created_at  >  present) : 

print ("CallError :  Account  should  not  be  created  in  future.") 
continue 

#if  i.year<=0:  print ("CallError :  Account  should  not  be  created  in  future.") 
elif  (present  >  after2year) :  user_created_at="OLD" 

else:  user_created_at="NEW"  _ 

print (user  created  at) 

i 

[Ltk41  Col:  0 


Figure  37.  Quantize  user_created_at  Column. 
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Ijjji  quantizeFollowers.py  -  C:\Users\zeyzeyim\Google  Drive\new_begin\PythonQuantizationScrip...  —  □  X 

File  Edit  Format  Run  Options  Window  Help 


Script  for  quantizing  the  follower s_count  column. 

The  followers_count  column  data  is  provided  as  an  argument:  one  string  separated 
Don’t  worry  about  this  part,  the  Java  caller  prepares  the  data  this  way 
The  output  values  are  printed  to  stdout  one  line  at  a  time 

nnn 

import  sys 
import  os 

list  =  sys.argv 

♦  list  =  [ '/Users/PythonQuantizationScripts/quantizeUserCreated.py' ,  '20_505_2005 

♦print (len (list) ) 

if  (len(list)  !=  2): 

print ("CallError:  expecting  two  arguments  (path  to  this  script  and  a  string 

sys . exit (0) 

♦  make  a  list  by  splitting  the  string  with  . 
cells  =  list(l] .split("_") 

♦print  (cells) 

♦print  (len (cells)) 

♦quantization 
♦outStr  =  "" 

followers_count  =  0 

for  cell  in  cells: 

icell  =  int(cell) 

♦  in  my  case,  this  column  displays  the  number  of  follower.  So  there  is  no  ne 
if  icell<0: 

print ("CallError :  number  of  friends  cannot  be  less  than  zero.") 
continue 

elif  icell<500:  followers_count="followers_countLT500" 

elif  icell<=20000 :  f ollowers_count“"f ollowers_count500to20000" 

else:  followers_count="followers_countGT20000" 

print (followers  count) 


Ln:  40  Col:  0 


Figure  38.  Quantize  followers_count  Column. 
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(jjji  quantizeFriends.py  -  C:\Users\zeyzeyim\Google  Drive\new_begin\PythonQuantizationScripts...  —  □  X 

File  Edit  Format  Run  Options  Window  Help 


Script  for  quantizing  the  friends_count  column. 

The  f riends_count  column  data  is  provided  as  an  argument :  one  string  separated  b 

The  Java  caller  prepares  the  data  this  way 

The  output  values  are  printed  to  stdout  one  line  at  a  time 

nun 

import  sys 
import  os 

list  =  sys.argv 

#list  =  [ '/Users/PythonQuantizationScripts/quantizeUserCreated.py' ,  '20_505_2005 

#print (len (list) ) 
if  (len(list)  !*  2): 

print ("CallError:  expecting  two  arguments  (path  to  this  script  and  a  string 

sys . exit (0) 

#  make  a  list  by  splitting  the  string  with  . 
cells  =  list(l) .split(n_") 

^quantization 
♦outStr  =  "n 

friends_count  =  0 

for  cell  in  cells: 

#*•*•**  THIS  IS  WHERE  YOU  MAKE  CHANGES  TO  THE  CODE  TO  REFLECT  YOUR  QUANTIZAT 
icell  =  int(cell) 

♦  in  my  case,  this  column  displays  the  number  of  friends.  So  there  is  no  neg 
if  icell<0: 

print ("CallError :  number  of  friends  cannot  be  less  than  zero.") 
continue 

elif  icell<500:  friends_count="friends_countLT500" 
elif  icell<=1000:  friends_count="friends_count500tol000" 
else :  friends_count="friends_countGT1000" 

print (friends  count) 


Ln:  31  Col:  0 


Figure  39.  Quantize  friends_count  Column. 
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yf  ’C:\Users\zeyzeyim\Google  Drive\new_begin\PythonQuantizationScripts\quantization.pfoperties  -  Notepad** 
Qosya  Diijenle  Ara  fioruntim  Kodlama  Diller  Ayarlar  Makrolar  £alijtir  fcklentiler  Pencereler  2 


□  X 


X 


otsia©  o  .&i  *  itci  a  c  m  <hi  a  *  j sa ui  is  a  e  a  si « 


1  *  Use  any  name  for  the  script  name,  but  l.h.s  must  match  column  name  in  csv  file 

2 

3  ♦  The  "followers_count"  column  quantization  (the  followers_count  column  in  my  csv)  script  is  quantizeFollowers.py 

4  f ollowers_count-quantizeFollowers .py 

5 

€  ♦  The  "friends_count"  column  quantization  (the  fnends_count  column  in  my  csv)  script  is  quantizeFriends .py 

friends_count«quantizeFriends .py 


8 


9  #  The  "user_created_at"  column  quantization  (the  user_created_at  column  in  my  csv)  script  is  quantizeUserCreated.py 

10  user_created_at*=quantizeUserCreated.py 

11 

12  #  Note  that  quantization. properties  has  Python  modules  only  for  three  columns  (followers_count,  friend_count,  and  user_created_at) ; 

13  #  Therefore,  other  columns  will  not  be  used  as  HMM  outputs. 

14  #  Stated  differently,  this  is  as  if  the  user  is  saying  that  the  classification  of  hidden  states  depends  on 

15  #  those  three  columns  and  not  the  others. 


Properties  file  length:  876  lines:  IS  Ln:1  Col:1  Sel :  0 1 0  UNIX  UTF-8  INS 


Figure  40.  Quantization  Properties  File. 
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E.  SANITY  TESTS  FOR  PROBABILISTIC  RUNTIME  VERIFICATION 


1.  Rule-1  Sanity  Test  for  DTRA_Rules 

[J]  SanityTestjava  S3 

1 

package  Rulel; 

2 

//import  static  org.junit .Assert . 

3- 

import  junit. framework. TestCase; 

<3 

import  org.junit. Before; 

i  5 

import  org. junit. Test;| 

6 

public  class  SanityXest  { 

7 

Rulel  DTRA  rule; 

8c 

SBefore 

9 

public  void  setup(){ 

10 

rule  =  new  Rulel  DTRA ( ) ; 

11 

) 

12€ 

@Test 

13 

public  void  test()  { 

14 

rule . P (1. 0) ; 

15 

double  d  =  rule. getProbabilityOf Success () ; 

16 

System. out. println ("d=”+d) ; 

17 

TestCase. assertfalse(rule.isSuccess () ) ; 

18 

//  probability  0  of  success  means  probability  1 

of  flagging  —  FLAG 

19 

TestCase. assert£guals(l-d,  1.0) ; 

2° 

} 

2l€ 

©Test 

22 

public  void  testl()  { 

23 

rule.timeoutFire (1.0) ; 

24 

double  d  =  rule. getProbabilityOf Success () ; 

25 

System. out. println ("in  test:  d="+d) ; 

26 

TestCase. assertTrue (rule. isSuccess () ) ; 

27 

//  probability  1  of  success  means  probability  0 

of  flagging  —  NO  FLAG 

2e 

TestCase. assertEguals(l-d,  0.0); 

29 

} 

30 

Figure  41.  Sanity  Test  for  Rulel_DTRA. 
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2. 


Rule-3  Sanity  Test  for  DTRA_Rules 


0  SanityTestjava  £2 

1  package  Rule3; 

2  //import  static  org. junit. Assert. *; 

3"  import  junit . framework. TestCase; 

4  import  org. junit. Before; 

5  import  org. junit. Test; 

6  public  class  SanityTest  { 

Rule3_DTRA  rule; 

8©  @Before 

9  public  void  setup (){ 

10  rule  =  new  Rule3_DTRA() ; 

11  > 

12©  gTest 

13  public  void  test()  { 

14  rule. P (1.0); 

15  rule.R(l.O) ; 

16  double  d  =  rule. getProbabilityOf Success () ; 

17  System. out. println ( "in  test;  d="+d) ; 

18  TestCase . assertTalse ( rule . isSuccess ( ) ) ; 

19  //  probability  0  of  success  means  probability  1  of  flagging 

20  TestCase. assertEguals (1-d,  1.0); 

21  > 

22©  gTest 

23  public  void  testl()  { 

24  rule . P (1 . 0) ; 

25  rule .Q_and_notR (1 . 0) ; 

26  rule. P (1.0); 

27  rule. Q_and_notR( 1.0) ; 

28  rule. R( 1.0) ; 

29  double  d  =  rule . getProbabilityOf Success () ; 

Svstem. out. println ("in  test:  d="+d) ; 


Figure  42.  Sanity  Test  for  Rule3_DTRA. 
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3. 


Rule-9  Sanity  Test  for  DTRA_Rules 


[7]  ’SanityTestjava  S3 

8 

public  void  setup(){ 

9 

rule  =  new  Rule9  DTRA(); 

10 

rule ,T=10; 

11 

rule . execTRreset () ;//enforce  Che  setting  of  T 

0 

@Test 

13* 

14 

public  void  test()  { 

IS 

rule . incrT ime ( 5 ) ; 

16 

rule. E  (1.0);  //add  probability  1  to  events 

17 

rule . incrTime ( 15) ; 

18 

rule. E (1.0);  //add  probability  1  to  events 

19 

double  d  =  rule. getProbabilityOf Success () ; 

20 

System. out. println ("in  test:  d="+d) ; 

21 

//E's  are  more  than  30  units  apart 

22 

TestCase . assertTrue (rule . isSuccess ( ) ) ; 

23 

//  probability  1  of  success  means  probability  0 

of 

flagging 

24 

TestCase . assertEquals (1-d,  0.0); 

} 

26* 

@Test 

27 

public  void  testl()  { 

28 

rule . incrTime (3) ; 

29 

rule.E(l.O);  //add  probability  1  to  events 

30 

rule . incrTime (5) ; 

31 

rule. E (1.0);  //add  probability  1  to  events 

32 

double  d  =  rule. getProbabilityOf Success () ; 

33 

System. out. println ("in  test:  d="+d) ; 

34 

//E's  are  less  than  30  units  apart 

35 

TestCase. assertFalse (rule. isSuccess () ) ; 

36 

//  probability  0  of  success  means  probability  1 

of 

flagging 

^7 

_ Te.sr.Ca.se  .  a  7  c  M  — rl . _ 1.01: _ 

Figure  43.  Sanity  Test  for  Rule9_DTRA. 
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4. 


Rule- 19  Sanity  Test  for  DTRA_Rules 


[7)  *SanityTest.java  £3 

Rulel9  DTRA  rule; 

8© 

SBefore 

9 

public  void  setup (){ 

10 

rule  =  new  Rulel9  DTRA(); 

11 

rule . N=2 ; 

12 

rule.T=10; 

rule.execTRreset () ;//enforce  the  setting  of  T 

} 

15© 

@Test 

16 

public  void  test()  { 

17 

rule . incrTime (1) ; 

18 

rule.Q(l.O);  //add  probability  1  to  events 

19 

rule . incrTime (1) ; 

20 

rule.E(l.O) ; 

21 

rule. incrTime (1) ; 

22 

rule. E( 1.0) ; 

23 

rule . E (1.0) ; 

24 

rule.E(l.O) ; 

25 

rule . incrTime ( 8 ) ; 

26 

rule.timeoutFire (1.0) ; 

27 

double  d  =  rule . getProbabilityOf Success () ; 

28 

System. out.println ( nd="+d) ; 

29 

TestCase.assertFalse(rule.isSuccess() ) ; 

30 

//  probability  0  of  success  means  probability  1  of  flagging  —  FLAG 

31 

TestCase.assertEguals(l-d,  1.0) ; 

} 

33© 

@Test 

34 

public  void  testl()  { 

35 

rule.Q(l.O) ; 

36 

_ 

rule . incrTime (3) ; 

Figure  44.  Sanity  Test  for  Rulel9_DTRA. 
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5, 


Rule-21  Sanity  Test  for  DTRA_Rules 


0  SanityTestjava  S3 

6 

public  class  SanityTest  { 

7 

Rule21  DTRA  rule; 

8^ 

©Before 

9 

public  void  setup(){ 

10 

rule  =  new  Rule21  DTRA(); 

11 

rule.N-1; 

12 

} 

13w 

©Test 

14 

public  void  test()  { 

15 

rule. E (1.0) ; 

16 

rule.Q  and  notE(l.O); 

17 

rule. E (1.0) ; 

18 

double  d  =  rule . getProbabilityOf Success () ; 

19 

System. out. println ("in  test:  d="+d) ; 

20 

TestCase . assertTrue (rule . isSuccess ( ) ) ; 

21 

//  probability  1  of  success  means  probability  0 

of 

flagging 

22 

TestCase . assertEquals (1-d,  0.0); 

23 

} 

24"? 

©Test 

25 

public  void  testl()  { 

26 

rule. E (1.0) ; 

27 

rule.Q  and  notE(l.O); 

28 

rule.Q  and  notE(l.O); 

29 

rule. E (1.0) ; 

30 

double  d  =  rule . getProbabilityOfSuccess ( ) ; 

31 

System. out. println ("in  test:  d="+d) ; 

32 

TestCase . assertFalse (rule . isSuccess () ) ; 

33 

//  probability  0  of  success  means  probability  1 

of 

flagging 

34 

TestCase . assertEquals (1-d,  1.0); 

35 

\ 

-L _ 

Figure  45.  Sanity  Test  for  Rule21_DTRA. 
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F.  COMMANDS  IN  CMD 


Table  7.  Commands. 

Action 

Command 

Generating  hmm.json  file 

java  -jar  dtrahmm.jar 
denemeler\0_denemeHMM_IS_HS .  c  s  v 

Argument- 1:  learning  phase  csv  file. 

Argument-2:  folder  of  quantization.properties  files 

Run  the  alpha  method 

java  -jar  dtraalpha.jar 

denemeler\0_denemeHMM_NO_IS_HS.csv 

n  PythonQuantizationScripts 
Argument- 1:  runtime  csv  file. 

Argument-2:  hmm.json  file 

Argument-3:  folder  of  quantization.properties  files 

Runtime  monitoring 

java  -jar  dtrarm.jar 

denemeler\0_denemeHMM_NO_IS_HS.csv 
denemeler\alpha.json  DTRA_Rules.jar  Rule3_DTRA 
Rules\bin\Rule3\Rule3_events. properties 

Argument- 1:  runtime  csv  file 

Argument-2:  path  to  alpha.json  file 

Argument-3:  path  to  DTRA_Rules.jar  file 

Argument-4:  the  rule  we  want  to  monitor 

Argument-5:  path  to  events. properties  files 

Color  code  for 
arguments 

Argument- 1  Argument-2  Argument-3^|  j  men:  -4 
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